1)Problem statment explanation:-¶

Customers with high retention potential are more likely to continue to do business with a company, purchase more products or services, and recommend the company to others. They tend to have a positive perception of the company, trust its brand, and feel satisfied with their overall experience.On the other hand, customers with low retention potential are at risk of churning, or ending their business relationship with the company. They may have had a negative experience with the company or its products, feel dissatisfied, or perceive a lack of value in their interactions with the company.By identifying and analyzing the retention potential of its customers, a company can develop strategies to improve customer loyalty and satisfaction, increase customer lifetime value, and reduce churn rates.

Attributes present in our data:-


1)CustomerID:-A unique identifier number a customer have for himself


2)City:-A city which a customer belong to.


3)DateOfRegistration:-Date when a customer registered.


4)OnlineCommunication:Whether a customer prefers online delivery or not.


5)AutomaticRefill:-Whether a customer likes his account to be refilled or not automatically when a term ends.


6)DoorstepDelivery:It is customer choice whether he prefer doorstep delivery or not.


7)PreferredDeliveryDay:A day from week on which customer prefers delivery.


8)Retention potential:It is a measure of the customer's loyalty and satisfaction with the company's products or services.


9)DateOfOrder:-Date of order generally refers to the date on which a customer placed an order with a company or a business.


10)timestamp:-A timestamp typically includes the year, month, day, hour, minute, and second of the event, and sometimes also includes information about the time zone or the time elapsed since a reference point.


11)OrderQuantity:-Order quantity generally refers to the number of units or items of a product that a customer has ordered from a business or a company. This data point is typically recorded in the company's database or order management system and is used for tracking and managing customer orders.


12)DateOfemail:-"Date of email" refers to the date and time when an email was sent or received by a recipient. This data point is typically recorded in the email client or email service provider's database and is used for tracking and managing email communications.


13)EmailType:-Email type generally refers to the category or classification of an email message based on its content or purpose.


14)timestamp:-A timestamp typically includes the year, month, day, hour, minute, and second of the event, and sometimes also includes information about the time zone or the time elapsed since a reference point.


15)MailOpened:-The mail opened metric is commonly used in email marketing to measure the effectiveness of email campaigns and to understand the behavior of recipients.


16)MailClicked:-The mail clicked metric is commonly used in email marketing to measure the effectiveness of email campaigns and to understand the behavior of recipients.


a)Importing libraries¶

In [1]:
#Import the required Libraries.
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report
#chi-square test of independence
from scipy.stats import chi2_contingency
from scipy.stats import chi2
import pandas as pd
import numpy as np

from IPython.display import Image
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler,OneHotEncoder

from sklearn.ensemble import AdaBoostClassifier,RandomForestClassifier,GradientBoostingClassifier

from sklearn.metrics import accuracy_score,classification_report, recall_score, precision_score, confusion_matrix, f1_score

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
import graphviz


import warnings
warnings.filterwarnings('ignore')

## conda install -c anaconda statsmodels
import statsmodels.api as sm

get_ipython().run_line_magic('matplotlib', 'inline')
pd.set_option('display.max_columns',125)
pd.set_option('display.max_rows',125)

2) Merging of files¶

In [2]:
customer=pd.read_excel("/content/Customerdata_Train.xlsx")
train=pd.read_excel("/content/Train.xlsx")
transaction=pd.read_excel("/content/transactionsdata_Train (1).xlsx")
email=pd.read_excel("/content/Emaildata_Train (1).xlsx")
In [3]:
transaction.columns
Out[3]:
Index(['CustomerID', 'DateOfOrder', 'timestamp', 'OrderQuantity'], dtype='object')
In [4]:
#checking all the datasets shape to get a proper understanding
print("shape of customer",customer.shape)
print("shape of train",train.shape)
print("shape of transaction",transaction.shape)
print("shape of email",email.shape)
shape of customer (23820, 7)
shape of train (23820, 2)
shape of transaction (104755, 4)
shape of email (517672, 6)

From this we have a understanding that our target column consists of 23820 rows,so it indirectly means we have to filter our datasets in such a way that we get 23820 unique values in customer id column for each dataset.so we will go through all the 4 datasets and check what to filter.

lets start with customerdata

In [5]:
#customer.head()
customer["CustomerID"].nunique()
#checking unique values for all the customer ids
Out[5]:
23820
In [6]:
customer.head()
Out[6]:
CustomerID City DateOfRegistration OnlineCommunication AutomaticRefill DoorstepDelivery PreferredDeliveryDay
0 C16368 CITY2 2012-09-28 0 0 0 Monday
1 C20345 CITY2 2010-12-19 1 1 1 Friday
2 C17683 CITY2 2010-10-03 0 0 0 Wednesday
3 C18647 CITY4 2010-11-27 0 0 0 Monday
4 C36692 CITY2 2009-12-04 1 1 0 Friday

train

In [7]:
#train.head()
train["CustomerID"].nunique()
#checking unique values for all the customer ids
Out[7]:
23820

transaction

In [8]:
#transaction['DateOfOrder'] = pd.to_datetime(transaction['DateOfOrder'])
transaction["CustomerID"].nunique()
#checking unique values for all the customer ids
Out[8]:
23820
In [9]:
# combine the 'Dateoforder' and 'Timestamp' columns into a single datetime column
transaction['last_transaction_date'] = pd.to_datetime(transaction['DateOfOrder'] +' '+transaction['timestamp'])

a)visualizing sales trends¶

In [10]:
monthly_orders = transaction.groupby(pd.Grouper(key='last_transaction_date', freq='M')).agg({'OrderQuantity': 'sum'})
# Create a subplot with 1 row and 2 columns
fig, axs = plt.subplots(1, 2, figsize=(14, 2))

# Plot monthly order quantities as a line chart
axs[0].plot(monthly_orders.index, monthly_orders['OrderQuantity'], color='blue')
axs[0].set_title('Monthly Order Quantities-lineplot')
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Order Quantity')

# Plot yearly order quantities as a bar chart
axs[1].bar(monthly_orders.index.year, monthly_orders['OrderQuantity'], color='green')
axs[1].set_title('Monthly Order Quantities-barplot')
axs[1].set_xlabel('Year')
axs[1].set_ylabel('Order Quantity')

# Display the plot
plt.show()

*Key points:-if we observe here,our monthly orders were highest during the period of 2013-2014**

In [11]:
yearly_orders = transaction.groupby(pd.Grouper(key='last_transaction_date', freq='Y')).agg({'OrderQuantity': 'sum'})
# Create a subplot with 1 row and 2 columns
fig, axs = plt.subplots(1, 2, figsize=(14, 2))

# Plot monthly order quantities as a line chart
axs[0].plot(yearly_orders.index, yearly_orders['OrderQuantity'], color='blue')
axs[0].set_title('yearly Order Quantities-lineplot')
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Order Quantity')

# Plot yearly order quantities as a bar chart
axs[1].bar(yearly_orders.index.year, yearly_orders['OrderQuantity'], color='green')
axs[1].set_title('yearly Order Quantities-barplot')
axs[1].set_xlabel('Year')
axs[1].set_ylabel('Order Quantity')

# Display the plot
plt.show()

If we observe here,yearly order quantities were highest during 2011-2014

Customer retention: By analyzing the average time between purchases, businesses can determine if their customers are coming back to make repeat purchases or if they are churning. If the average time between purchases is increasing over time, it may indicate that customers are losing interest in the business or finding better alternatives.

In [12]:
transaction['DateOfOrder']=pd.to_datetime(transaction['DateOfOrder'])
# Sort the dataset by customer ID and purchase date
df3 = transaction.sort_values(['CustomerID', 'DateOfOrder'])

# Calculate the time between purchases for each customer
df3['time_between_purchases'] = df3.groupby('CustomerID')['DateOfOrder'].diff().dt.days.fillna(0)
In [13]:
df3['order_month']=pd.to_datetime(df3['DateOfOrder']).dt.month
In [14]:
# group the data by customer ID and month, and count the number of orders in each group
order_counts = df3.groupby(['CustomerID', 'order_month'])['OrderQuantity'].sum()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent months
order_counts = order_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
order_counts = order_counts.fillna(0)
order_counts.columns=["Totalorders-jan","Totalorders-feb","Totalorders-march","Totalorders-april","Totalorders-may","Totalorders-june","Totalorders-july","Totalorders-august","Totalorders-september","Totalorders-october","Totalorders-november","Totalorders-december"]
order_counts=order_counts.reset_index(drop=False)
In [15]:
# Group the data by customer ID and calculate the number of transactions, total amount spent, and last transaction date
grouped_data = df3.groupby('CustomerID').agg({
    'CustomerID': 'count', # Number of transactions
    'OrderQuantity': ['sum','min','max'],# Total orders
    'last_transaction_date': ['min', 'max'],# first transaction date,Last transaction date
    'time_between_purchases':'mean'#avaerage time between purchases
    
})
# Rename the columns to be more descriptive
grouped_data.columns = ['num_transactions','totalorders','minimumorderquantity','maximumorderquantity','first_transaction_date','last_transaction_date','average_time_between_purchase']
In [16]:
grouped_data['DateOfOrder'] = grouped_data['last_transaction_date'].dt.date.astype(str)
grouped_data['timestamp'] = grouped_data['last_transaction_date'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
grouped_data["averageorders"]=grouped_data["totalorders"]//grouped_data["num_transactions"]
grouped_data = grouped_data.reset_index(drop=False)
In [17]:
# merge the grouped DataFrame back with the main DataFrame
transaction_data = pd.merge(df3, grouped_data, on=['CustomerID','last_transaction_date'],how="inner")
In [18]:
transaction_data = pd.merge(transaction_data, order_counts, on=['CustomerID'],how="inner")
In [19]:
transaction['last_transaction_date']
transaction = transaction.rename(columns={'last_transaction_date': 'transaction_date'})
In [20]:
#transaction_data
In [21]:
transaction_data.shape
#rechecking for incorrect dtypes
#transaction_data["CustomerID"].nunique()
#rechecking for unique customerids
Out[21]:
(23820, 28)
In [22]:
transaction_data=transaction_data.drop(['DateOfOrder_x', 'timestamp_x','DateOfOrder_y', 'timestamp_y','time_between_purchases','order_month'],axis=1)
In [23]:
transaction_data.head()
Out[23]:
CustomerID OrderQuantity last_transaction_date num_transactions totalorders minimumorderquantity maximumorderquantity first_transaction_date average_time_between_purchase averageorders Totalorders-jan Totalorders-feb Totalorders-march Totalorders-april Totalorders-may Totalorders-june Totalorders-july Totalorders-august Totalorders-september Totalorders-october Totalorders-november Totalorders-december
0 C12116 33 2013-10-10 19:50:02 3 115 33 44 2013-10-03 13:49:08 2.333333 38 0 0 0 0 0 0 0 0 0 115 0 0
1 C12117 41 2013-12-05 18:47:14 1 41 41 41 2013-12-05 18:47:14 0.000000 41 0 0 0 0 0 0 0 0 0 0 0 41
2 C12118 41 2013-09-10 10:08:42 1 41 41 41 2013-09-10 10:08:42 0.000000 41 0 0 0 0 0 0 0 0 41 0 0 0
3 C12119 52 2013-10-22 20:48:00 1 52 52 52 2013-10-22 20:48:00 0.000000 52 0 0 0 0 0 0 0 0 0 52 0 0
4 C12120 31 2014-01-10 13:24:12 1 31 31 31 2014-01-10 13:24:12 0.000000 31 31 0 0 0 0 0 0 0 0 0 0 0

for email

In [24]:
email.head()
Out[24]:
CustomerID DateOfemail EmailType timestamp MailOpened MailClicked
0 C16368 2012-09-28 WelcomeEmail 15:56:50 yes NotApplicable
1 C20345 2010-12-19 WelcomeEmail 15:10:30 yes NotApplicable
2 C17683 2010-10-03 WelcomeEmail 17:12:24 yes NotApplicable
3 C18647 2010-11-27 WelcomeEmail 16:59:02 yes NotApplicable
4 C36692 2009-12-04 WelcomeEmail 17:27:22 yes NotApplicable
In [25]:
email.columns
Out[25]:
Index(['CustomerID', 'DateOfemail', 'EmailType', 'timestamp', 'MailOpened',
       'MailClicked'],
      dtype='object')
In [26]:
# combine the 'Date of email' and 'Time' columns into a single datetime column
email['emailDatetime'] = pd.to_datetime(email['DateOfemail'] +' '+email['timestamp'])
In [27]:
email["EmailType"].value_counts()
Out[27]:
CustomizedEmail    493852
WelcomeEmail        23820
Name: EmailType, dtype: int64
In [28]:
email.columns
Out[28]:
Index(['CustomerID', 'DateOfemail', 'EmailType', 'timestamp', 'MailOpened',
       'MailClicked', 'emailDatetime'],
      dtype='object')

lets try to visualize different email categories

b)visualizing emails sent year and month wise¶

In [29]:
#lets visualize yearwise 
monthly_mails = email.groupby(pd.Grouper(key='emailDatetime', freq='M')).agg({'emailDatetime': 'count'})
# Create a subplot with 1 row and 2 columns
fig, axs = plt.subplots(1, 2, figsize=(14, 2))

# Plot monthly order quantities as a line chart
axs[0].plot(monthly_mails.index, monthly_mails['emailDatetime'], color='blue')
axs[0].set_title('Monthly mails-lineplot')
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Mails')

# Plot monthly order quantities as a bar chart
axs[1].bar(monthly_mails.index.year, monthly_mails['emailDatetime'], color='green')
axs[1].set_title('Monthly mails-barplot')
axs[1].set_xlabel('Year')
axs[1].set_ylabel('Mails')

# Display the plot
plt.show()

If we observe monthly mails during 2013-2014,they were highest

In [30]:
#lets visualize yearwise 
yearly_mails = email.groupby(pd.Grouper(key='emailDatetime', freq='Y')).agg({'emailDatetime': 'count'})
# Create a subplot with 1 row and 2 columns
fig, axs = plt.subplots(1, 2, figsize=(14, 2))

# Plot monthly order quantities as a line chart
axs[0].plot(yearly_mails.index, yearly_mails['emailDatetime'], color='blue')
axs[0].set_title('yearly mails-lineplot')
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Mails')

# Plot yearly order quantities as a bar chart
axs[1].bar(yearly_mails.index.year, yearly_mails['emailDatetime'], color='green')
axs[1].set_title('yearly mails-barplot')
axs[1].set_xlabel('Year')
axs[1].set_ylabel('Mails')

# Display the plot
plt.show()

if we observe properly,monthly order quantities and mails were highest during 2013-2014 period so it proves that mails sent directly proportional to order quantities.

In [31]:
# group the data by customer ID and month, and count the number of times mail oponed or not for each customer
mail_open_counts = email.groupby(['CustomerID', 'MailOpened'])['emailDatetime'].count()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent mail oponed
mail_open_counts = mail_open_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
mail_open_counts = mail_open_counts.fillna(0)
mail_open_counts.columns=["Totalmails-notoponed","Totalmails-oponed"]
mail_open_counts=mail_open_counts.reset_index(drop=False)

We had calculated the total number of times a customer had oponed and total number of times a customer had not oponed his mail on receival of mail

In [32]:
#mail_open_counts
In [33]:
# group the data by customer ID and month, and count the number of times mail oponed or not for each customer
MailClicked_counts = email.groupby(['CustomerID', 'MailClicked'])['emailDatetime'].count()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent mail oponed
MailClicked_counts = MailClicked_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
MailClicked_counts = MailClicked_counts.fillna(0)
MailClicked_counts.columns=["Notapplicableclickedmails","timesclicked-yes","timesclicked-no"]
MailClicked_counts=MailClicked_counts.reset_index(drop=False)

total number of times a customer has clicked on mail and total number of times a customer had not clicked his mail from all the mails received

In [34]:
MailClicked_counts
Out[34]:
CustomerID Notapplicableclickedmails timesclicked-yes timesclicked-no
0 C12116 1 29 2
1 C12117 1 0 0
2 C12118 1 0 0
3 C12119 1 37 4
4 C12120 1 0 0
... ... ... ... ...
23815 C42185 1 43 0
23816 C42186 1 0 0
23817 C42187 1 0 0
23818 C42188 1 0 0
23819 C42189 1 50 0

23820 rows × 4 columns

In [34]:
 

There are two possibilities here:

  1. a customer might have received only welcome mail but not customized mail
  2. a customer might have received both welcome mail and customized mail. we have to filter here such a way that unique ids with welcome mail only being sent should be considered and unique ids with recent customized mail also .we will take most recent interaction with our customer till data.
In [35]:
#filtering customers who have received welcome mail but not a customized mail
welcome_mail = email[(email['EmailType'] == 'WelcomeEmail') & ~(email['CustomerID'].isin(email[email['EmailType'] == 'CustomizedEmail']['CustomerID'].unique()))]
welcome_mail["CustomerID"].nunique()
Out[35]:
9709
In [36]:
welcome_mail.head()
Out[36]:
CustomerID DateOfemail EmailType timestamp MailOpened MailClicked emailDatetime
2 C17683 2010-10-03 WelcomeEmail 17:12:24 yes NotApplicable 2010-10-03 17:12:24
15 C29837 2011-02-22 WelcomeEmail 16:30:28 yes NotApplicable 2011-02-22 16:30:28
18 C13691 2011-03-10 WelcomeEmail 12:38:08 yes NotApplicable 2011-03-10 12:38:08
25 C14905 2009-04-24 WelcomeEmail 10:38:10 yes NotApplicable 2009-04-24 10:38:10
30 C22750 2011-01-04 WelcomeEmail 12:26:48 yes NotApplicable 2011-01-04 12:26:48
In [37]:
welcome_mail.shape
Out[37]:
(9709, 7)

Analyzing the average time between each email can provide businesses with some insights into customer engagement and help them identify opportunities to improve customer retention.

In [38]:
welcome_mail['DateOfemail']=pd.to_datetime(welcome_mail['DateOfemail'])
# Sort the dataset by customer ID and purchase date
welcome_mail = welcome_mail.sort_values(['CustomerID', 'DateOfemail'])

# Calculate the time between purchases for each customer
welcome_mail['time_between_emails'] = welcome_mail.groupby('CustomerID')['DateOfemail'].diff().dt.days.fillna(0)
In [39]:
# Group the data by customer ID and calculate the number of emails sent to customer,first email sent, and last email date
grouped_data3 = welcome_mail.groupby('CustomerID').agg({
    'CustomerID': 'count', # Number of emails
    'emailDatetime': ['max','min'], # Last email date,first email date
    'time_between_emails':'mean' 
})
# Rename the columns to be more descriptive
grouped_data3.columns = ['numofemails','lastemailDatetime','firstemailDatetime','average_time_between_mails']
grouped_data3 = grouped_data3.reset_index(drop=False)
In [40]:
grouped_data3['DateOfemail'] = grouped_data3['lastemailDatetime'].dt.date.astype(str)
grouped_data3['timestamp'] = grouped_data3['lastemailDatetime'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
In [41]:
welcome_mail['DateOfemail']=welcome_mail['DateOfemail'].dt.date.astype(str)
In [42]:
# merge the grouped DataFrame back with the main DataFrame
merged_data3 = pd.merge(welcome_mail, grouped_data3, on=['CustomerID','DateOfemail','timestamp'],how="inner")
In [43]:
#merged_data.shape
#merged_data["CustomerID"].nunique()
#merged_data.columns
merged_data3=merged_data3.drop(['DateOfemail','timestamp'],axis=1)
In [44]:
#filtering persons who received customized mail
customizedmail=email[email['EmailType'] == 'CustomizedEmail']
In [45]:
customizedmail.head()
Out[45]:
CustomerID DateOfemail EmailType timestamp MailOpened MailClicked emailDatetime
23820 C16368 2013-05-20 CustomizedEmail 15:10:30 yes no 2013-05-20 15:10:30
23821 C16368 2013-02-05 CustomizedEmail 13:54:22 yes no 2013-02-05 13:54:22
23822 C16368 2012-11-26 CustomizedEmail 11:22:32 yes no 2012-11-26 11:22:32
23823 C16368 2013-05-30 CustomizedEmail 14:04:42 yes no 2013-05-30 14:04:42
23824 C16368 2013-05-18 CustomizedEmail 10:18:16 yes no 2013-05-18 10:18:16
In [46]:
customizedmail['DateOfemail']=pd.to_datetime(customizedmail['DateOfemail'])
# Sort the dataset by customer ID and purchase date
customizedmail = customizedmail.sort_values(['CustomerID', 'DateOfemail'])

# Calculate the time between purchases for each customer
customizedmail['time_between_emails'] = customizedmail.groupby('CustomerID')['DateOfemail'].diff().dt.days.fillna(0)
In [47]:
# Group the data by customer ID and calculate the number of emails sent to customer,first email sent, and last email date
grouped_data2 = customizedmail.groupby('CustomerID').agg({
    'CustomerID': 'count', # Number of emails
    'emailDatetime': ['max','min'], # Last email date,first email date
    'time_between_emails':'mean' 
})
In [48]:
grouped_data2
Out[48]:
CustomerID emailDatetime time_between_emails
count max min mean
CustomerID
C12116 31 2013-10-09 11:49:08 2013-09-24 11:48:56 0.483871
C12119 41 2013-10-21 10:24:24 2013-04-22 17:13:42 4.439024
C12122 45 2013-04-09 10:21:20 2012-11-13 10:29:56 3.266667
C12123 28 2014-01-10 13:59:24 2013-11-16 16:01:48 1.964286
C12124 32 2013-11-06 17:45:54 2013-10-31 13:22:32 0.187500
... ... ... ... ...
C42181 47 2013-10-29 11:53:36 2011-08-14 10:06:08 17.170213
C42182 49 2013-11-28 17:51:08 2012-08-31 10:32:00 9.265306
C42183 41 2017-08-13 17:43:26 2017-08-07 11:18:38 0.146341
C42185 43 2011-12-30 13:52:52 2011-03-05 11:01:04 6.976744
C42189 50 2013-11-05 16:32:30 2013-01-20 16:07:58 5.780000

14111 rows × 4 columns

In [49]:
# Rename the columns to be more descriptive
grouped_data2.columns = ['numofemails','lastemailDatetime','firstemailDatetime','average_time_between_mails']
grouped_data2 = grouped_data2.reset_index(drop=False)
In [50]:
grouped_data2.head()
Out[50]:
CustomerID numofemails lastemailDatetime firstemailDatetime average_time_between_mails
0 C12116 31 2013-10-09 11:49:08 2013-09-24 11:48:56 0.483871
1 C12119 41 2013-10-21 10:24:24 2013-04-22 17:13:42 4.439024
2 C12122 45 2013-04-09 10:21:20 2012-11-13 10:29:56 3.266667
3 C12123 28 2014-01-10 13:59:24 2013-11-16 16:01:48 1.964286
4 C12124 32 2013-11-06 17:45:54 2013-10-31 13:22:32 0.187500
In [51]:
grouped_data2['DateOfemail'] = grouped_data2['lastemailDatetime'].dt.date.astype(str)
grouped_data2['timestamp'] = grouped_data2['lastemailDatetime'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
In [52]:
customizedmail['DateOfemail']=customizedmail['DateOfemail'].dt.date.astype(str)
In [53]:
# merge the grouped DataFrame back with the main DataFrame
merged_data = pd.merge(customizedmail, grouped_data2, on=['CustomerID','DateOfemail','timestamp'],how="inner")
In [54]:
merged_data
Out[54]:
CustomerID DateOfemail EmailType timestamp MailOpened MailClicked emailDatetime time_between_emails numofemails lastemailDatetime firstemailDatetime average_time_between_mails
0 C12116 2013-10-09 CustomizedEmail 11:49:08 no no 2013-10-09 11:49:08 0.0 31 2013-10-09 11:49:08 2013-09-24 11:48:56 0.483871
1 C12119 2013-10-21 CustomizedEmail 10:24:24 no no 2013-10-21 10:24:24 9.0 41 2013-10-21 10:24:24 2013-04-22 17:13:42 4.439024
2 C12122 2013-04-09 CustomizedEmail 10:21:20 no no 2013-04-09 10:21:20 6.0 45 2013-04-09 10:21:20 2012-11-13 10:29:56 3.266667
3 C12123 2014-01-10 CustomizedEmail 13:59:24 no no 2014-01-10 13:59:24 1.0 28 2014-01-10 13:59:24 2013-11-16 16:01:48 1.964286
4 C12124 2013-11-06 CustomizedEmail 17:45:54 no no 2013-11-06 17:45:54 0.0 32 2013-11-06 17:45:54 2013-10-31 13:22:32 0.187500
... ... ... ... ... ... ... ... ... ... ... ... ...
14106 C42181 2013-10-29 CustomizedEmail 11:53:36 yes no 2013-10-29 11:53:36 53.0 47 2013-10-29 11:53:36 2011-08-14 10:06:08 17.170213
14107 C42182 2013-11-28 CustomizedEmail 17:51:08 no no 2013-11-28 17:51:08 2.0 49 2013-11-28 17:51:08 2012-08-31 10:32:00 9.265306
14108 C42183 2017-08-13 CustomizedEmail 17:43:26 no no 2017-08-13 17:43:26 0.0 41 2017-08-13 17:43:26 2017-08-07 11:18:38 0.146341
14109 C42185 2011-12-30 CustomizedEmail 13:52:52 no no 2011-12-30 13:52:52 2.0 43 2011-12-30 13:52:52 2011-03-05 11:01:04 6.976744
14110 C42189 2013-11-05 CustomizedEmail 16:32:30 yes no 2013-11-05 16:32:30 1.0 50 2013-11-05 16:32:30 2013-01-20 16:07:58 5.780000

14111 rows × 12 columns

In [55]:
#merged_data.shape
#merged_data["CustomerID"].nunique()
#merged_data.columns
merged_data=merged_data.drop(['DateOfemail','timestamp'],axis=1)
In [56]:
merged_data.shape
Out[56]:
(14111, 10)
In [57]:
merged_data3.shape
Out[57]:
(9709, 10)
In [58]:
mail_data= pd.concat([merged_data3,merged_data])
mail_data.head()
Out[58]:
CustomerID EmailType MailOpened MailClicked emailDatetime time_between_emails numofemails lastemailDatetime firstemailDatetime average_time_between_mails
0 C12117 WelcomeEmail yes NotApplicable 2013-12-05 15:59:30 0.0 1 2013-12-05 15:59:30 2013-12-05 15:59:30 0.0
1 C12118 WelcomeEmail yes NotApplicable 2013-09-10 14:49:14 0.0 1 2013-09-10 14:49:14 2013-09-10 14:49:14 0.0
2 C12120 WelcomeEmail yes NotApplicable 2014-01-10 11:40:10 0.0 1 2014-01-10 11:40:10 2014-01-10 11:40:10 0.0
3 C12126 WelcomeEmail yes NotApplicable 2013-09-13 12:56:02 0.0 1 2013-09-13 12:56:02 2013-09-13 12:56:02 0.0
4 C12137 WelcomeEmail yes NotApplicable 2017-10-05 13:53:38 0.0 1 2017-10-05 13:53:38 2017-10-05 13:53:38 0.0
In [59]:
mail_data=mail_data.drop(["time_between_emails"],axis=1)

now we have to merge all files into single dataframe

In [60]:
train=customer.merge(train, on="CustomerID").merge(mail_data, on="CustomerID").merge(transaction_data, on="CustomerID").merge(mail_open_counts, on="CustomerID").merge(MailClicked_counts, on="CustomerID")
train.shape
Out[60]:
(23820, 42)

3) Data exploration and visualization-Exploratory data analysis¶

Preprocessing steps:-

1.Explaining problem statment

2.Data structure and content

3.Exploratory data analysis

4.Visual Analysis of data

5.Extracting insights

1)PROBLEM STATMENT EXPLANATION:- Predicting the retail customer retention potential level as high/medium/low & Customer Segmentation & Pattern extraction. Retail customer retention potential level refers to the likelihood of customers returning to make additional purchases from a retail business. This potential can be categorized into three levels: high, medium, and low. High potential customers are those who are likely to make repeat purchases and remain loyal to the brand. Medium potential customers may make additional purchases but are not as likely to remain loyal. Low potential customers are unlikely to make additional purchases or remain loyal to the brand.

2)DATA STRUCTURE AND CONTENT

In [61]:
#To veiw first five rows and five columns
train.head()
Out[61]:
CustomerID City DateOfRegistration OnlineCommunication AutomaticRefill DoorstepDelivery PreferredDeliveryDay RetentionPotential EmailType MailOpened MailClicked emailDatetime numofemails lastemailDatetime firstemailDatetime average_time_between_mails OrderQuantity last_transaction_date num_transactions totalorders minimumorderquantity maximumorderquantity first_transaction_date average_time_between_purchase averageorders Totalorders-jan Totalorders-feb Totalorders-march Totalorders-april Totalorders-may Totalorders-june Totalorders-july Totalorders-august Totalorders-september Totalorders-october Totalorders-november Totalorders-december Totalmails-notoponed Totalmails-oponed Notapplicableclickedmails timesclicked-yes timesclicked-no
0 C16368 CITY2 2012-09-28 0 0 0 Monday High CustomizedEmail yes no 2013-07-25 12:46:04 29 2013-07-25 12:46:04 2012-10-07 17:29:40 10.034483 15 2013-08-11 15:17:52 1 15 15 15 2013-08-11 15:17:52 0.000000 15 0 0 0 0 0 0 0 15 0 0 0 0 0 30 1 28 1
1 C20345 CITY2 2010-12-19 1 1 1 Friday Low CustomizedEmail yes no 2013-12-19 14:28:06 95 2013-12-19 14:28:06 2010-12-21 10:43:32 11.515789 84 2014-01-19 18:11:30 186 15614 61 108 2011-04-01 19:50:00 5.505376 83 671 1145 696 1303 1516 1523 1679 1675 1314 1140 1321 1631 7 89 1 85 10
2 C17683 CITY2 2010-10-03 0 0 0 Wednesday Medium WelcomeEmail yes NotApplicable 2010-10-03 17:12:24 1 2010-10-03 17:12:24 2010-10-03 17:12:24 0.000000 34 2011-07-06 18:12:10 13 443 27 51 2010-12-01 11:38:20 16.692308 34 68 0 64 70 64 80 34 0 0 0 0 63 0 1 1 0 0
3 C18647 CITY4 2010-11-27 0 0 0 Monday Low CustomizedEmail yes no 2012-12-17 11:00:04 30 2012-12-17 11:00:04 2011-02-10 16:19:20 22.533333 111 2013-01-28 17:04:46 7 748 92 117 2010-11-29 15:23:52 113.000000 106 111 114 0 0 216 0 0 105 0 0 110 92 3 28 1 26 4
4 C36692 CITY2 2009-12-04 1 1 0 Friday Low CustomizedEmail no no 2014-01-16 13:34:52 64 2014-01-16 13:34:52 2009-12-24 10:59:42 23.187500 69 2014-01-16 11:21:46 34 2350 55 83 2011-09-21 19:54:50 24.941176 69 69 148 211 279 131 249 137 214 294 131 291 196 46 19 1 54 10
In [62]:
#to know how many rows and columns are present
train.shape
Out[62]:
(23820, 42)
In [63]:
#understanding datatypes of each column
train.dtypes
Out[63]:
CustomerID                               object
City                                     object
DateOfRegistration               datetime64[ns]
OnlineCommunication                       int64
AutomaticRefill                           int64
DoorstepDelivery                          int64
PreferredDeliveryDay                     object
RetentionPotential                       object
EmailType                                object
MailOpened                               object
MailClicked                              object
emailDatetime                    datetime64[ns]
numofemails                               int64
lastemailDatetime                datetime64[ns]
firstemailDatetime               datetime64[ns]
average_time_between_mails              float64
OrderQuantity                             int64
last_transaction_date            datetime64[ns]
num_transactions                          int64
totalorders                               int64
minimumorderquantity                      int64
maximumorderquantity                      int64
first_transaction_date           datetime64[ns]
average_time_between_purchase           float64
averageorders                             int64
Totalorders-jan                           int64
Totalorders-feb                           int64
Totalorders-march                         int64
Totalorders-april                         int64
Totalorders-may                           int64
Totalorders-june                          int64
Totalorders-july                          int64
Totalorders-august                        int64
Totalorders-september                     int64
Totalorders-october                       int64
Totalorders-november                      int64
Totalorders-december                      int64
Totalmails-notoponed                      int64
Totalmails-oponed                         int64
Notapplicableclickedmails                 int64
timesclicked-yes                          int64
timesclicked-no                           int64
dtype: object
In [64]:
#checking for unique values in each column such that we could whether there are any incorrect dtypes
for i in train.columns:
  print("no of unique values and dtype:","|",i,"|",train[i].nunique(),"|",train[i].dtype)
no of unique values and dtype: | CustomerID | 23820 | object
no of unique values and dtype: | City | 4 | object
no of unique values and dtype: | DateOfRegistration | 2673 | datetime64[ns]
no of unique values and dtype: | OnlineCommunication | 2 | int64
no of unique values and dtype: | AutomaticRefill | 2 | int64
no of unique values and dtype: | DoorstepDelivery | 2 | int64
no of unique values and dtype: | PreferredDeliveryDay | 7 | object
no of unique values and dtype: | RetentionPotential | 3 | object
no of unique values and dtype: | EmailType | 2 | object
no of unique values and dtype: | MailOpened | 2 | object
no of unique values and dtype: | MailClicked | 3 | object
no of unique values and dtype: | emailDatetime | 23771 | datetime64[ns]
no of unique values and dtype: | numofemails | 83 | int64
no of unique values and dtype: | lastemailDatetime | 23771 | datetime64[ns]
no of unique values and dtype: | firstemailDatetime | 23789 | datetime64[ns]
no of unique values and dtype: | average_time_between_mails | 5669 | float64
no of unique values and dtype: | OrderQuantity | 310 | int64
no of unique values and dtype: | last_transaction_date | 23792 | datetime64[ns]
no of unique values and dtype: | num_transactions | 150 | int64
no of unique values and dtype: | totalorders | 2206 | int64
no of unique values and dtype: | minimumorderquantity | 304 | int64
no of unique values and dtype: | maximumorderquantity | 325 | int64
no of unique values and dtype: | first_transaction_date | 23798 | datetime64[ns]
no of unique values and dtype: | average_time_between_purchase | 3069 | float64
no of unique values and dtype: | averageorders | 307 | int64
no of unique values and dtype: | Totalorders-jan | 693 | int64
no of unique values and dtype: | Totalorders-feb | 595 | int64
no of unique values and dtype: | Totalorders-march | 621 | int64
no of unique values and dtype: | Totalorders-april | 628 | int64
no of unique values and dtype: | Totalorders-may | 678 | int64
no of unique values and dtype: | Totalorders-june | 659 | int64
no of unique values and dtype: | Totalorders-july | 693 | int64
no of unique values and dtype: | Totalorders-august | 668 | int64
no of unique values and dtype: | Totalorders-september | 697 | int64
no of unique values and dtype: | Totalorders-october | 728 | int64
no of unique values and dtype: | Totalorders-november | 726 | int64
no of unique values and dtype: | Totalorders-december | 749 | int64
no of unique values and dtype: | Totalmails-notoponed | 70 | int64
no of unique values and dtype: | Totalmails-oponed | 61 | int64
no of unique values and dtype: | Notapplicableclickedmails | 1 | int64
no of unique values and dtype: | timesclicked-yes | 80 | int64
no of unique values and dtype: | timesclicked-no | 30 | int64
In [65]:
fig, ax = plt.subplots(figsize=(10, 3))
# Iterating over the columns
for col in train.columns:
    # Calculate the number of unique values in the column
    num_unique = len(train[col].unique())
    # Plot a bar for each column
    ax.bar(col, num_unique)
# Customize the chart
ax.set_xlabel('Column Name')
ax.set_ylabel('Number of Unique Values')
ax.set_title('Unique Values in Each Column')
plt.xticks(rotation=90)

# Show the chart
plt.show()

if we observe here Online comminication,automatic refill,DoorstepDelivery are labelled incorrect dtypes,so we have to convert these datatypes

In [66]:
#converting dtypes of the following columns
train["OnlineCommunication"]=train["OnlineCommunication"].astype("object")
train["DoorstepDelivery"]=train["DoorstepDelivery"].astype("object")
train["AutomaticRefill"]=train["AutomaticRefill"].astype("object")
In [67]:
# Basic Info of the Data Frame
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 23820 entries, 0 to 23819
Data columns (total 42 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   CustomerID                     23820 non-null  object        
 1   City                           23820 non-null  object        
 2   DateOfRegistration             23820 non-null  datetime64[ns]
 3   OnlineCommunication            23820 non-null  object        
 4   AutomaticRefill                23820 non-null  object        
 5   DoorstepDelivery               23820 non-null  object        
 6   PreferredDeliveryDay           23820 non-null  object        
 7   RetentionPotential             23820 non-null  object        
 8   EmailType                      23820 non-null  object        
 9   MailOpened                     23820 non-null  object        
 10  MailClicked                    23820 non-null  object        
 11  emailDatetime                  23820 non-null  datetime64[ns]
 12  numofemails                    23820 non-null  int64         
 13  lastemailDatetime              23820 non-null  datetime64[ns]
 14  firstemailDatetime             23820 non-null  datetime64[ns]
 15  average_time_between_mails     23820 non-null  float64       
 16  OrderQuantity                  23820 non-null  int64         
 17  last_transaction_date          23820 non-null  datetime64[ns]
 18  num_transactions               23820 non-null  int64         
 19  totalorders                    23820 non-null  int64         
 20  minimumorderquantity           23820 non-null  int64         
 21  maximumorderquantity           23820 non-null  int64         
 22  first_transaction_date         23820 non-null  datetime64[ns]
 23  average_time_between_purchase  23820 non-null  float64       
 24  averageorders                  23820 non-null  int64         
 25  Totalorders-jan                23820 non-null  int64         
 26  Totalorders-feb                23820 non-null  int64         
 27  Totalorders-march              23820 non-null  int64         
 28  Totalorders-april              23820 non-null  int64         
 29  Totalorders-may                23820 non-null  int64         
 30  Totalorders-june               23820 non-null  int64         
 31  Totalorders-july               23820 non-null  int64         
 32  Totalorders-august             23820 non-null  int64         
 33  Totalorders-september          23820 non-null  int64         
 34  Totalorders-october            23820 non-null  int64         
 35  Totalorders-november           23820 non-null  int64         
 36  Totalorders-december           23820 non-null  int64         
 37  Totalmails-notoponed           23820 non-null  int64         
 38  Totalmails-oponed              23820 non-null  int64         
 39  Notapplicableclickedmails      23820 non-null  int64         
 40  timesclicked-yes               23820 non-null  int64         
 41  timesclicked-no                23820 non-null  int64         
dtypes: datetime64[ns](6), float64(2), int64(24), object(10)
memory usage: 7.8+ MB

if we observe here we dont have problem of null values here,so no imputation is required as of now

we can get new columns from existing which might be useful for our analysis:

1)difference between our last_transaction_date and first_transaction_date to know a customer interaction or to know how long a customer has been purchasing in the platform .

2)difference between first email and last email to know what is email_duration.email_duration refers to the time period between the first and last email of a particular campaign or sequence.For example, if the email_duration is too short, it may suggest that you are not giving your audience enough time to engage with your content. Conversely, if the cadence is too long, it may indicate that your emails are not engaging enough or that you are not providing enough value to your subscribers.

3)difference between date of regestration and last transaction data also helps us to know how old or new our customer.By subtracting the registration date from the last transaction date, you can determine the total duration of the customer's relationship with the company

In [68]:
train["days_since_first_transaction"]=(train['last_transaction_date'] - train['first_transaction_date']).dt.days
train["email_duration"]=(train['lastemailDatetime']- train['firstemailDatetime']).dt.days
train["duration_of_customer_relation"]=(train['last_transaction_date'] - train['DateOfRegistration']).dt.days

4)the time elapsed since the customer's last transaction.The resulting values will be a measure of the duration of time since each customer's last transaction.Here are some reasons why calculating the time elapsed since the customer's last transaction is important: Identify at-risk customers: If a customer has not made a purchase in a long time, they may be at risk of churn. By tracking the time elapsed since their last transaction, you can identify at-risk customers and take steps to retain them.Measure customer engagement: Customers who make frequent purchases are generally more engaged with your products or services. By measuring the time elapsed since their last transaction, you can get a sense of how engaged customers are and identify opportunities to increase engagement.Develop targeted marketing campaigns: Customers who have not made a purchase in a while may be more likely to respond to targeted marketing campaigns. By segmenting customers based on the time elapsed since their last transaction, you can develop campaigns that are tailored to their specific needs and interests.Overall, calculating the time elapsed since the customer's last transaction is a valuable metric for understanding customer behavior and developing effective marketing and retention strategies.

In [69]:
train["recency"]=(train["last_transaction_date"].max()-train["last_transaction_date"]).dt.days

We will drop few columns such as first transaction date,first email date as we kept those columns only to identify the difference between the latest one and them,no we will drop this columns

In [70]:
#train=train.drop(["first_transaction_date","firstemailDatetime","emailDatetime"],axis=1)

1)if we observe here if first transaction is last transaction,then difference will be coming out as 0 but it would be irrelevant to context of our data.In such cases either we should not consider the rows with zeroes,or we have to bin this columns as categorical columns. 2)if we observe here if first email is last email,then difference will be coming out as 0 in email cadence column but it would be irrelevant to context of our data.In such cases either we should not consider the rows with zeroes,or we have to bin this columns as categorical columns. so we will bin this two columns.

In [71]:
bins = pd.cut(train["days_since_first_transaction"], [-1, 1, 10, 100, train["days_since_first_transaction"].max()], labels=["Only transaction", "Recent Transactions", "Moderate History", "Long-term History"])
train["days_since_first_transaction_category"] = bins
train["days_since_first_transaction_category"].value_counts()
Out[71]:
Only transaction       15013
Long-term History       4111
Moderate History        3989
Recent Transactions      707
Name: days_since_first_transaction_category, dtype: int64
In [72]:
bins = pd.cut(train["email_duration"], [-1, 1, 10, 100, train["email_duration"].max()], labels=["Only email", "Medium history", "long mail history", "very Long mail History"])
train["email_duration_category"] = bins
train["email_duration_category"].value_counts()
Out[72]:
Only email                10954
long mail history          6026
very Long mail History     4960
Medium history             1880
Name: email_duration_category, dtype: int64
In [73]:
train["email_duration_category"]=train["email_duration_category"].astype("object")
train["days_since_first_transaction_category"]=train["days_since_first_transaction_category"].astype("object")
In [74]:
#as we had already binned this columns so we will drop them
#train=train.drop(["days_since_first_transaction","email_duration"],axis=1)
In [75]:
#understanding statstical info of our data
train.describe()
Out[75]:
numofemails average_time_between_mails OrderQuantity num_transactions totalorders minimumorderquantity maximumorderquantity average_time_between_purchase averageorders Totalorders-jan Totalorders-feb Totalorders-march Totalorders-april Totalorders-may Totalorders-june Totalorders-july Totalorders-august Totalorders-september Totalorders-october Totalorders-november Totalorders-december Totalmails-notoponed Totalmails-oponed Notapplicableclickedmails timesclicked-yes timesclicked-no days_since_first_transaction email_duration duration_of_customer_relation recency
count 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.00000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.0 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000 23820.000000
mean 21.140260 3.318651 62.107976 4.397775 357.622922 58.586608 65.653778 12.527876 61.931444 32.711545 19.269270 21.799874 22.378757 26.32351 27.053904 27.685516 28.474349 35.221662 39.969941 40.028128 36.706465 15.525273 6.207389 1.0 19.563854 1.168808 86.753568 105.719563 129.325441 1426.424223
std 19.620475 12.121427 38.729691 12.142475 1335.649680 36.428571 41.270385 36.102332 38.318825 128.808814 103.623359 114.836843 113.052766 125.30813 127.011724 129.660528 130.733997 132.849129 138.678043 134.577762 138.087054 17.239732 8.517405 0.0 19.168208 2.376062 220.450104 241.409428 266.695842 675.530650
min 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 0.000000 41.000000 1.000000 41.000000 40.000000 41.000000 0.000000 41.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 1469.000000
50% 19.000000 0.210526 51.000000 1.000000 65.000000 48.000000 53.000000 0.000000 51.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 7.000000 2.000000 1.0 17.000000 0.000000 0.000000 6.000000 16.000000 1543.000000
75% 40.000000 2.116279 75.000000 3.000000 169.000000 69.000000 80.000000 10.000000 74.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 10.250000 41.000000 40.000000 0.000000 32.000000 8.000000 1.0 38.000000 1.000000 41.000000 69.000000 102.000000 1663.000000
max 291.000000 561.000000 1501.000000 257.000000 51429.000000 1501.000000 1501.000000 807.500000 1501.000000 3788.000000 3198.000000 3670.000000 3679.000000 6114.00000 4939.000000 4035.000000 5179.000000 3358.000000 4845.000000 5461.000000 4408.000000 91.000000 256.000000 1.0 249.000000 42.000000 1984.000000 1963.000000 1998.000000 3463.000000
In [76]:
#train.dtypes

if we observe we have extreme value present in each of column.so we will think whther to replace those columns or transform them on basis of boxplot we will get

In [77]:
#sepreating numerical and categorical columns
numericalcolumns=[]
categoricalcolumns=[]
datetypecolumns=[]
for i in train.columns:
  if train[i].dtypes=="object":
    categoricalcolumns.append(i)
  elif train[i].dtypes=="datetime64[ns]": 
    datetypecolumns.append(i) 
  else:
    numericalcolumns.append(i)   
In [78]:
print(numericalcolumns)
print(categoricalcolumns)
print(datetypecolumns)
['numofemails', 'average_time_between_mails', 'OrderQuantity', 'num_transactions', 'totalorders', 'minimumorderquantity', 'maximumorderquantity', 'average_time_between_purchase', 'averageorders', 'Totalorders-jan', 'Totalorders-feb', 'Totalorders-march', 'Totalorders-april', 'Totalorders-may', 'Totalorders-june', 'Totalorders-july', 'Totalorders-august', 'Totalorders-september', 'Totalorders-october', 'Totalorders-november', 'Totalorders-december', 'Totalmails-notoponed', 'Totalmails-oponed', 'Notapplicableclickedmails', 'timesclicked-yes', 'timesclicked-no', 'days_since_first_transaction', 'email_duration', 'duration_of_customer_relation', 'recency']
['CustomerID', 'City', 'OnlineCommunication', 'AutomaticRefill', 'DoorstepDelivery', 'PreferredDeliveryDay', 'RetentionPotential', 'EmailType', 'MailOpened', 'MailClicked', 'days_since_first_transaction_category', 'email_duration_category']
['DateOfRegistration', 'emailDatetime', 'lastemailDatetime', 'firstemailDatetime', 'last_transaction_date', 'first_transaction_date']

3.Data Quality Check And Missing Values

3.1 Percentage of missing values for columns

In [79]:
# checking column-wise null percentages
cols = pd.DataFrame(train.isnull().mean().round(4) * 100, columns =['precentage_missing_value']).sort_values(by=['precentage_missing_value'])
print(cols)
                                       precentage_missing_value
CustomerID                                                  0.0
Totalorders-feb                                             0.0
Totalorders-march                                           0.0
Totalorders-april                                           0.0
Totalorders-may                                             0.0
Totalorders-june                                            0.0
Totalorders-july                                            0.0
Totalorders-august                                          0.0
Totalorders-september                                       0.0
Totalorders-october                                         0.0
Totalorders-jan                                             0.0
Totalorders-november                                        0.0
Totalmails-notoponed                                        0.0
Totalmails-oponed                                           0.0
Notapplicableclickedmails                                   0.0
timesclicked-yes                                            0.0
timesclicked-no                                             0.0
days_since_first_transaction                                0.0
email_duration                                              0.0
duration_of_customer_relation                               0.0
recency                                                     0.0
Totalorders-december                                        0.0
averageorders                                               0.0
average_time_between_purchase                               0.0
first_transaction_date                                      0.0
City                                                        0.0
DateOfRegistration                                          0.0
OnlineCommunication                                         0.0
AutomaticRefill                                             0.0
DoorstepDelivery                                            0.0
PreferredDeliveryDay                                        0.0
RetentionPotential                                          0.0
EmailType                                                   0.0
MailOpened                                                  0.0
MailClicked                                                 0.0
emailDatetime                                               0.0
numofemails                                                 0.0
lastemailDatetime                                           0.0
firstemailDatetime                                          0.0
average_time_between_mails                                  0.0
OrderQuantity                                               0.0
last_transaction_date                                       0.0
num_transactions                                            0.0
totalorders                                                 0.0
minimumorderquantity                                        0.0
maximumorderquantity                                        0.0
days_since_first_transaction_category                       0.0
email_duration_category                                     0.0

Though there are no missing values in columns such as average_time_between_mails,average_time_between_purchase but there cannot be 0s in rows of those columns so we will have to replace them either with mean,median by checking distributions(dist plot) or bin this columns with few labels

In [80]:
#replacing the 0's with either mean or median by checking their distributions
sns.distplot(train["average_time_between_mails"])
#if we see the distribution it is not normally distributed ,it is right skewed distribution and even it is not clear.we will replace here with median as it is not affected by outliers
Out[80]:
<Axes: xlabel='average_time_between_mails', ylabel='Density'>
In [81]:
#replacing the 0's with either mean or median by checking their distributions
sns.distplot(train["average_time_between_purchase"])
#if we see the distribution it is not normally distributed ,it is right skewed distribution and even it is not clear.we will replace here with median as it is not affected by outliers
Out[81]:
<Axes: xlabel='average_time_between_purchase', ylabel='Density'>
In [82]:
#train.columns
In [83]:
#1)for column average_time_between_mails
# replace zeros with NaN
train["average_time_between_mails"].replace(0, pd.np.nan, inplace=True)
# replace NaN values with the median of the column
median =train["average_time_between_mails"].median()
train["average_time_between_mails"].fillna(median, inplace=True)
#2)for column average_time_between_purchase
# replace zeros with NaN
train["average_time_between_purchase"].replace(0, pd.np.nan, inplace=True)
# replace NaN values with the median of the column
median =train["average_time_between_purchase"].median()
train["average_time_between_purchase"].fillna(median, inplace=True)
In [84]:
# Check total duplicated values
train.duplicated().sum()
Out[84]:
0

a)Visualizing the total orders by month¶

In [366]:
total_orders = {}
for month in order_counts.columns:
    if month.startswith('Totalorders-'):
        total_orders[month] = order_counts[month].sum()
total_orders
# Create a list of months and a list of total orders
months = list(total_orders.keys())
orders = list(total_orders.values())
# Create a bar plot
plt.figure(figsize=(5,5))
plt.bar(months, orders)
plt.xticks(rotation=90)
# Set the title and axis labels
plt.title('Total Orders by Month')
plt.xlabel('Month')
plt.ylabel('Total Orders')
# Show the plot
plt.show()

Orders were more than 200000 in jan,september,october,november,december,and less than 200000 in other months

In [85]:
train["Totalorderquantity_in_first_quarter"]=train["Totalorders-jan"]+train["Totalorders-feb"]+train["Totalorders-march"]++train["Totalorders-april"]
train["Totalorderquantity_in_second_quarter"]=train["Totalorders-may"]+train["Totalorders-june"]+train["Totalorders-july"]++train["Totalorders-august"]
train["Totalorderquantity_in_Third_quarter"]=train["Totalorders-december"]+train["Totalorders-september"]+train["Totalorders-october"]++train["Totalorders-november"]
In [86]:
train=train.drop(['Totalorders-jan', 'Totalorders-feb',
       'Totalorders-march', 'Totalorders-april', 'Totalorders-may',
       'Totalorders-june', 'Totalorders-july', 'Totalorders-august',
       'Totalorders-september', 'Totalorders-october', 'Totalorders-november',
       'Totalorders-december'],axis=1)

checking categorical columns

In [87]:
train["City"].value_counts()
Out[87]:
CITY4    9143
CITY2    6879
CITY3    6490
CITY1    1308
Name: City, dtype: int64
In [88]:
train["OnlineCommunication"].value_counts()
# map OnlineCommunication to new categories
new_categories = {
    0: 'no',
    1: 'yes',
}
train['OnlineCommunication'] = train['OnlineCommunication'].map(new_categories)

# print the resulting dataframe
print(train['OnlineCommunication'].value_counts())
yes    15731
no      8089
Name: OnlineCommunication, dtype: int64
In [89]:
#train["AutomaticRefill"].value_counts()
# map AutomaticRefill to new categories
new_categories = {
    0: 'no',
    1: 'yes',
}
train['AutomaticRefill'] = train['AutomaticRefill'].map(new_categories)

# print the resulting dataframe
print(train['AutomaticRefill'].value_counts())
no     21564
yes     2256
Name: AutomaticRefill, dtype: int64
In [90]:
#train["DoorstepDelivery"].value_counts()
# map DoorstepDelivery to new categories
new_categories = {
    0: 'no',
    1: 'yes',
}
train['DoorstepDelivery'] = train['DoorstepDelivery'].map(new_categories)

# print the resulting dataframe
print(train['DoorstepDelivery'].value_counts())
no     22924
yes      896
Name: DoorstepDelivery, dtype: int64
In [91]:
train["PreferredDeliveryDay"].value_counts()
# map PreferredDeliveryDay to new categories
new_categories = {
    'Monday': 'Monday',
    'Tuesday':'Tuesday',
    'Wednesday':'Wednesday',
    'Thursday':'Thursday',
    'Friday':'Friday',
    'Saturday':'weekend',
    'Sunday':'weekend'

}
train['PreferredDeliveryDay'] = train['PreferredDeliveryDay'].map(new_categories)

# print the resulting dataframe
print(train['PreferredDeliveryDay'].value_counts())
Tuesday      5343
Monday       5300
Friday       4167
Thursday     4043
Wednesday    3778
weekend      1189
Name: PreferredDeliveryDay, dtype: int64
In [92]:
train["RetentionPotential"].value_counts()
Out[92]:
Low       19185
High       3756
Medium      879
Name: RetentionPotential, dtype: int64
In [93]:
train["EmailType"].value_counts()
Out[93]:
CustomizedEmail    14111
WelcomeEmail        9709
Name: EmailType, dtype: int64
In [94]:
train["MailOpened"].value_counts()
Out[94]:
yes    13749
no     10071
Name: MailOpened, dtype: int64
In [95]:
train["MailClicked"].value_counts()
Out[95]:
no               13191
NotApplicable     9709
yes                920
Name: MailClicked, dtype: int64
In [96]:
train["Notapplicableclickedmails"].value_counts()#there is only one value present in whole hence we can remove this column.
train=train.drop(["Notapplicableclickedmails"],axis=1)
In [97]:
for i in categoricalcolumns:
  train[i]=train[i].astype("category")
In [98]:
train[categoricalcolumns].dtypes
Out[98]:
CustomerID                               category
City                                     category
OnlineCommunication                      category
AutomaticRefill                          category
DoorstepDelivery                         category
PreferredDeliveryDay                     category
RetentionPotential                       category
EmailType                                category
MailOpened                               category
MailClicked                              category
days_since_first_transaction_category    category
email_duration_category                  category
dtype: object
In [99]:
#sepreating numerical and categorical columns
numericalcolumns=[]
categoricalcolumns=[]
datetypecolumns=[]
for i in train.columns:
  if train[i].dtypes=="category":
    categoricalcolumns.append(i)
  elif train[i].dtypes=="datetime64[ns]": 
    datetypecolumns.append(i) 
  else:
    numericalcolumns.append(i)
In [100]:
numericalcolumns
Out[100]:
['numofemails',
 'average_time_between_mails',
 'OrderQuantity',
 'num_transactions',
 'totalorders',
 'minimumorderquantity',
 'maximumorderquantity',
 'average_time_between_purchase',
 'averageorders',
 'Totalmails-notoponed',
 'Totalmails-oponed',
 'timesclicked-yes',
 'timesclicked-no',
 'days_since_first_transaction',
 'email_duration',
 'duration_of_customer_relation',
 'recency',
 'Totalorderquantity_in_first_quarter',
 'Totalorderquantity_in_second_quarter',
 'Totalorderquantity_in_Third_quarter']

outlier analysis for numerical columns

In [212]:
numcols=["OrderQuantity",'numofemails','num_transactions','days_since_first_transaction','recency']
In [213]:
for i in numcols:
  plt.figure(figsize=(6,6))
  sns.boxplot(train[i])
  plt.title("Boxplot of " + i)
  
  # calculate quartiles and upper/lower bounds
  q1 = train[i].quantile(0.25)
  q2 = train[i].quantile(0.5)
  q3 = train[i].quantile(0.75)
  iqr = q3 - q1
  upper = q3 + 1.5*iqr
  lower = q1 - 1.5*iqr
  
  # mark quartiles and upper/lower bounds on the plot
  plt.axhline(q1, color='r', linestyle='--', label='Q1')
  plt.axhline(q2, color='g', linestyle='-', label='Q2 (Median)')
  plt.axhline(q3, color='b', linestyle='--', label='Q3')
  plt.axhline(upper, color='gray', linestyle='--', label='Upper Bound')
  plt.axhline(lower, color='gray', linestyle='--', label='Lower Bound')
  
  plt.legend()
  plt.show()
In [214]:
#percentage of outliers present in each column
for i in numcols:
    
 IQR = (train[i]).quantile(0.75) - (train[i]).quantile(0.25)
 Upper_limit = (train[i]).quantile(0.75) + IQR*1.5
 lower_limit = (train[i]).quantile(0.25) - IQR*1.5
 print("|",i,"|",round(100.0 * len(train[(train[i])>Upper_limit])/len(train), 2)+round(100.0 * len(train[(train[i])<lower_limit])/len(train), 2))
| OrderQuantity | 5.8
| numofemails | 0.02
| num_transactions | 12.09
| days_since_first_transaction | 17.11
| recency | 34.28

Here we are not going to replace outliers because they are not missing at random and they might have significant importance.so we will not remove them because they might be important and useful for our model building.

3.a)Visualizations without target variable¶

1)Analysis of customers city wise¶

In [103]:
grouped_data = train.groupby("City").size().reset_index(name="Number of Customers")
import matplotlib.pyplot as plt
# define colors for each bar
colors = ['#fc4f30', '#e5ae37', '#6d904f', '#a27ea8']
# plot bar chart
grouped_data.plot(kind='bar', x='City', y='Number of Customers',color=colors)
# set chart title and axis labels
plt.title('Number of Customers per City')
plt.xlabel('City')
plt.ylabel('Number of Customers')
# show plot
plt.show()

For city 1 if we see number of customers per city are far too less compared to city 2,city3 and city4,from this we can draw out few conclusions which might or might not be true

1)More the no. of customers,there more the focus should be as there is higher potenial for customer retention,further analysis might reveal,status of retention potenial in those cities.

2)lesser the customers more marketing and schemes should be devloped there such that more customers are introduced.

In [104]:
# group the data by city and retention potenial, and count the number of customers for each city
counts = train.groupby(['City','RetentionPotential']).size().reset_index(name="Number of Customers")
counts
Out[104]:
City RetentionPotential Number of Customers
0 CITY1 High 125
1 CITY1 Low 1155
2 CITY1 Medium 28
3 CITY2 High 1257
4 CITY2 Low 5168
5 CITY2 Medium 454
6 CITY3 High 866
7 CITY3 Low 5503
8 CITY3 Medium 121
9 CITY4 High 1508
10 CITY4 Low 7359
11 CITY4 Medium 276

1)If we observe here city 2 has highest number of customers with medium retention potenial,so here we should not lose those customers aas they are valuable for our analysis


2)City 4 is having 7359 customers with low retention potenial.Their risk of churning is very higher compared to other cities.so,reaccesment should take place such that churning do not happen

In [105]:
# Group the data by city and calculate the total orders by that city customers,number of transactions did by that city customers, and comparing number of transactions each quarter
data2 = train.groupby('City').agg({'CustomerID':'count','totalorders':['sum','mean'],'num_transactions':'sum','Totalorderquantity_in_first_quarter':'sum','Totalorderquantity_in_second_quarter':'sum','Totalorderquantity_in_Third_quarter':'sum'})
data2.head()
Out[105]:
CustomerID totalorders num_transactions Totalorderquantity_in_first_quarter Totalorderquantity_in_second_quarter Totalorderquantity_in_Third_quarter
count sum mean sum sum sum sum
City
CITY1 1308 261920 200.244648 3632 52923 70081 138916
CITY2 6879 4380420 636.781509 48860 1296802 1439515 1644103
CITY3 6490 1098124 169.202465 15700 234813 222573 640738
CITY4 9143 2778114 303.851471 36563 705980 877009 1195125

1)From the groupby operation we had done above we can find that city 2 is having 6879 customers but they are having large ammount of customers who had ordered so much with total orders and num of transactions summing upto 4380420 orders in 48860 transactions,with order quantity in each quarter growing up.Also we have a clear evidence from the plot above this that there are nearly 1700 customers who has retention potenial of high and medium.So customers from city 2 would add great value to our analysis.

2)Similarly for city 4,if we observe from our previous plot we can see that risk of churning is higher in city 4 was higher compared to others,but customer with retention potenial of medium and high are also more,if we number of orders and number of transactions were second highest to other cities,and quarter wise its been improving

3)Average order value from city1 is higher than city3 and there are less customers in city 1 compared to others,and city 3 has lesser orderquantity during 2nd quarter of a year

In [106]:
#checking number of customers registered over years
plt.figure(figsize=(5,5))
train['DateOfRegistration'] = pd.to_datetime(train['DateOfRegistration'])
train['RegistrationYear'] = train['DateOfRegistration'].dt.year
yearly_registrations = train.groupby('RegistrationYear').count()['CustomerID']
plt.plot(yearly_registrations.index, yearly_registrations.values)
plt.xlabel('Year')
plt.ylabel('Number of Registrations')
plt.title('no of customers over time')
plt.show()
In [107]:
yearly_registrations
Out[107]:
RegistrationYear
2008       45
2009      349
2010     1171
2011     2765
2012     1980
2013    12009
2014     1097
2015      705
2016      533
2017     2956
2018      210
Name: CustomerID, dtype: int64

Till 2013 customers registrations,customers acquired exponentially,but after 2013 suddenly registration were dropped,and in 2016 again it increased and then dropped. Here we have to find out the reason why in year 2013, no. of registrations were dropped.Earlier,if we observed our transaction and mails visualizations it had also shown that it was good during the period of 2013-2014 but after that it had dropped,no of registrations might be one of the potenial reasons.

2)checking correlation between few of numerical columns¶

In [108]:
plt.figure(figsize=(4,4))
sns.heatmap(train[['averageorders','average_time_between_mails',
  'average_time_between_purchase',                 
 'duration_of_customer_relation',
 'recency']].corr(),annot=True)
Out[108]:
<Axes: >
**1)If we observe average_time_between_mails has positive correlation between duration_of_customer_relation**
**2)recency have no correlation at all with any of the columns**
**3)If we observe average_time_between_purchases has positive correlation between duration_of_customer_relation**

3)email duration category with mail oponed¶

In [109]:
import plotly.express as px
In [110]:
#lets compare email duration category with mail oponed with yes or no
fig = px.histogram(train, x="MailOpened", color="email_duration_category",barmode="group",  title="<b>email duration w.r.t mail oponed</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

points to be noted:-

1)On their last email,if we see only 1382 customers with very long email history has oponed mail while 3578 customers has not oponed their mails.

2)people who have received only mail,880 customers had not even oponed the mail.people who received only mail are people with welcome mails,we can understand that 880 customers were not intrested as they had not oponed.

3)544 Customers with medium history has oponed while 1336 has not oponed the mail.

4)4277 customers with long mail history has not oponed the mail while 1749 customers has not oponed..

Keypoint-2)people who have received only mail,880 customers had not even oponed the mail.people who received only mail are people with welcome mails,we can understand that 880 customers were not intrested as they had not oponed.

b)OnlineCommunication category with days_since_first_transaction_category¶

In [111]:
#lets compare OnlineCommunication category with days_since_first_transaction_category 
fig = px.histogram(train, x="OnlineCommunication", color="days_since_first_transaction_category",barmode="group",  title="<b>OnlineCommunication category with days_since_first_transaction_category </b>")
fig.update_layout(width=700, height=300, bargap=0.1)
fig.show()

1)Only transaction:-Customers prefer online communication more

2)Longtermhistory:-lesser Customers prefer online communication.

3)ModerateHistory:-Customers prefer online communication more.

4)RecentTransactions:-Customers prefered online communication more

Keypoint-customers with long term history are lesser likely to have online communication

c)days_since_first_transaction_category w.r.t RetentionPotential¶

In [112]:
#lets compare email duration category with mail oponed with yes or no
fig = px.histogram(train, x="days_since_first_transaction_category", color="RetentionPotential",barmode="group",  title="<b>city w.r.t PreferredDeliveryDay</b>")
fig.update_layout(width=700, height=300, bargap=0.1)
fig.show()

Customers with moderate history,long term history have few customers with medium retention possibilty.this would be useful for our analysis as our focus is on customers with medium retention potenial because if we can concentrate on them we can improve our business,as they had already been customer since so much time we have to plan something for them such that they would remain loyal customers.

4)Lets try to understand how minimum and maximum order columns of a customer will help us to understand his/her purchase behaviour?¶

In [113]:
train[["minimumorderquantity","maximumorderquantity","averageorders"]].describe()
Out[113]:
minimumorderquantity maximumorderquantity averageorders
count 23820.000000 23820.000000 23820.000000
mean 58.586608 65.653778 61.931444
std 36.428571 41.270385 38.318825
min 0.000000 0.000000 0.000000
25% 40.000000 41.000000 41.000000
50% 48.000000 53.000000 51.000000
75% 69.000000 80.000000 74.000000
max 1501.000000 1501.000000 1501.000000
In [114]:
sns.scatterplot(x=train['minimumorderquantity'], y=train['maximumorderquantity'])
plt.xlabel('Minimum Order Quantity')
plt.ylabel('Maximum Order Quantity')
plt.title('Relationship between Minimum and Maximum Order Quantity')
plt.show()
#there is a linear relationship between minimum and maximum order quantity,as it is reduntant column we can consider removing them.

There is a linear relation between this columns we can remove this columns as it will not help during our model building time,so we will not use this columns for our analysis,as we have average orders in our data so we will also check with that column.

In [115]:
sns.scatterplot(x=train['minimumorderquantity'], y=train['averageorders'])
plt.xlabel('Minimum Order Quantity')
plt.ylabel('averageorders Quantity')
plt.title('Relationship between Minimum and averageorders Quantity')
plt.show()
#there is a linear relationship between minimum and average order quantity,as it is reduntant column we can consider removing them.

There is a linear relation between this columns we can remove this columns as it will not help during our model building time,so we will not use this columns for our analysis,as we have average orders in our data so we will also check with that column.

In [116]:
# Identify most profitable customers
profitable_customers = train[train['averageorders'] > 100].reset_index()
# Optimize inventory management
large_order_customers = train[train['averageorders'] > 50]
small_order_customers = train[train['averageorders'] <= 50]
# Improve customer experience based on order quantity
train['customer_segment'] = train.apply(lambda row: 'large_order_customer' if row['OrderQuantity'] > 50 else 'small_order_customer', axis=1)
train['customer_segment']=train['customer_segment'].astype("category")
In [117]:
train['customer_segment']=train['customer_segment'].astype("category")

Segmenting customers based upon their average order quantities as small ordercustomers and large order customers as customer segment.

3b)Key visualizations with respect to target variable.¶

In [118]:
#Checking imbalance in data
# Finding % of customers with retention potential of high,low,medium.

target_high_percentage = (round (len(train.query('RetentionPotential=="High"'))/len(train),3)) * 100
print ("target_high_percentage:", target_high_percentage,"%")

target_Medium_percentage = (round (len(train.query('RetentionPotential=="Medium"'))/len(train),3)) * 100
print ("target_Medium_percentage:", target_Medium_percentage,"%")

target_Low_percentage = (round (len(train.query('RetentionPotential=="Low"'))/len(train),4)) * 100
print ("target_Low_percentage:", target_Low_percentage,"%")
target_high_percentage: 15.8 %
target_Medium_percentage: 3.6999999999999997 %
target_Low_percentage: 80.54 %

if we can see there is high imbalance in the data.Target low percentage is very high in proportion,while target medium percentage and target high are very low

Creation of three data sets - one for each high,medium and low

In [119]:
# Creating Dataframe of the retention potenial with high customers  
high= train.query('RetentionPotential=="High"')
# Creating Dataframe of the retention potenial with low customers 
low= train.query('RetentionPotential=="Low"')
# Creating Dataframe of the retention potenial with medium customers 
medium= train.query('RetentionPotential=="Medium"')
In [120]:
catcolanalysis=['OnlineCommunication','AutomaticRefill','DoorstepDelivery','MailOpened','MailClicked','days_since_first_transaction_category','email_duration_category','customer_segment']
In [121]:
#loop for performing univariate analysis
for i in catcolanalysis:
    plt.figure(figsize=(10,5))
    plt.subplot(1,3,1)
    high[i].value_counts(normalize=True).plot.bar()
    plt.title(i+ '-high')
    plt.subplot(1,3,2)
    low[i].value_counts(normalize=True).plot.bar()
    plt.title(i+ '- low')
    plt.subplot(1,3,3)
    medium[i].value_counts(normalize=True).plot.bar()
    plt.title(i+ ' - medium')

Understandings from the above plot:-

  1. Online communication -Medium(target=retentionpotenial(medium) have a lesser percentage of customers who are preferring online communication
  2. Automatic refill -High(target=retentionpotenial(high)) has not at all preffered to have automatic refill.
  3. Doorstep delivery - customers across all the targets had the same ammount of preference,most of them wanted to have doorstep delivery.
  4. Mail oponed -Low(target=retentionpotenial(low) has compartively higher percentage of customers not opening the mail.
  5. Mail clicked-medium (target=retentionpotenial(medium) has clicked the mail most than the others.
  6. days_since_first_transaction category:- a)only transaction:-customers with higher chances of retention are more. b)moderate history:-Customers with medium chances of retention are higher. c)Long-term history:-Customers with medium chances of retention are higher. d)Recent -transaction :-customers were nearly equal across all the categories of retention potenial.
  7. Email_duration_category:- a)Only email:-High(retentionpotenial(high) has more no of customers with only email sent b)Long mail history:-LOW(retentionpotenial(low)) had more no of customers with long mail history. c)very long mail history:LOW(retentionpotenial(low)) had more no of customers with very long mail history. d)medium history:-it is mostly same across all the graphs.
  8. Customersegment:-if we observe here large order customers are more than the small order customers for
    retention potenial-medium customers

the time elapsed since the customer's last transaction.The resulting values will be a measure of the duration of time since each customer's last transaction

In [122]:
#lets check transaction category w.r.t how recent a customer has been and segregating it by using hue as retention potenial.
sns.scatterplot(x=train["days_since_first_transaction_category"],y=train["recency"],hue=train["RetentionPotential"])
Out[122]:
<Axes: xlabel='days_since_first_transaction_category', ylabel='recency'>

1)Customers with long term history,moderate history and there recenct transaction between 0 and 500 days, we can see there are few customers whose retention potenial is medium

2)Customers with moderate history and there recency between 2500 and 3000 days,has also a fewer customers with medium possibility of retention.

In [123]:
#
ax = sns.kdeplot(train.recency[(train["RetentionPotential"] == 'Medium') ],
                color="Gold", shade = True);
ax = sns.kdeplot(train.recency[(train["RetentionPotential"] == 'High') ],
                ax =ax, color="Green", shade= True);
               
ax.legend(["High","Medium"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('customer recency');
ax.set_title('Distribution of customer recency by retention potenial');

it shows the relative frequency of observations in each range of customer recency.

1)0-1000 days-for target-high there are more no. of customers present

2)1000-4000 days - they are not that active customers but still they can be retained but if we observe medium customers are more,there might be a chance of them churning.

a)Order quantities by registration every year¶

In [124]:
for i in range(2008,2019,2):
   t=train[train["RegistrationYear"]==i]
   fig = px.pie(t, values='OrderQuantity', names='RetentionPotential', title=f'order quantity every year w.r.t to registration year {i}')
   fig.show()

Here we are comparing order quantities by no. of registrations every 2 years. If we observe year 2010 had been the highest ordered quantity for medium customers and high customers Intresting point is that the customers who are more likely to churn have the highest oreder quantity rate in 2018

b)Interaction of customers w.r.t company and to identify which services are most popular and how to capitalize on them based on their prefrences.¶

In [216]:
train.columns
Out[216]:
Index(['CustomerID', 'City', 'OnlineCommunication', 'AutomaticRefill',
       'DoorstepDelivery', 'PreferredDeliveryDay', 'RetentionPotential',
       'EmailType', 'MailOpened', 'MailClicked', 'numofemails',
       'average_time_between_mails', 'OrderQuantity', 'num_transactions',
       'sumoforderquantity', 'average_time_between_purchase',
       'days_since_first_transaction', 'email_duration',
       'duration_of_customer_relation', 'recency',
       'days_since_first_transaction_category', 'email_duration_category',
       'RegistrationYear', 'customer_segment', 'Time_to_First_Email',
       'Time_since_Last_Email', 'most_ordered_quarter', 'Registration_month',
       'Registration_weekday', 'lastemail_month', 'lastemail_weekday',
       'transaction_month', 'transaction_weekday', 'open_rate',
       'click_through_rate', 'response_rate', 'open_rate_ratio',
       'click_through_rate_ratio', 'response_rate_ratio'],
      dtype='object')
In [217]:
# create a cross-tabulation table to count the number of cases
ct = pd.crosstab(index=medium['DoorstepDelivery'],
                 columns=[medium['AutomaticRefill'],medium['RetentionPotential']])

# create the heatmap with annotations showing the counts
plt.figure(figsize=(5, 5))
sns.heatmap(ct, annot=True, fmt='d')
plt.show()

There are 740 customers who had not prefered doorstepdelivery and no automatic refill which there might be some underlying factors,so maybe in near future company should improve the following services.

In [126]:
# create a cross-tabulation table to count the number of cases
ct = pd.crosstab(index=medium['days_since_first_transaction_category'],
                 columns=[medium['email_duration_category'],medium['RetentionPotential']])

# create the heatmap with annotations showing the counts
plt.figure(figsize=(10, 5))
sns.heatmap(ct, annot=True, fmt='d')
plt.show()

There are 355 medium retention potenial customers with long term transaction history and only mail being sent to them.

Customers with medium retention potenial and long term transaction history are compartively more than the others,so we can target this customers and prevent them from churning

4)Feature Engineering.¶

Columns such as Time_to_First_Email and Time_since_Last_Email might be useful as we can know whether a mail to customer has been sent before the transaction or after the transaction,and time since last email will help us to know last at which mails has been sent

In [220]:
train["Time_to_First_Email"]=(train["first_transaction_date"]-train["firstemailDatetime"]).dt.days
train["Time_to_First_Email"]=train["Time_to_First_Email"].astype("int")
train["Time_since_Last_Email"]=(train["last_transaction_date"]-train["lastemailDatetime"]).dt.days
train["Time_since_Last_Email"]=train["Time_since_Last_Email"].astype("int")
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3801             try:
-> 3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:

/usr/local/lib/python3.9/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/usr/local/lib/python3.9/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'first_transaction_date'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-220-22291da778a3> in <cell line: 1>()
----> 1 train["Time_to_First_Email"]=(train["first_transaction_date"]-train["firstemailDatetime"]).dt.days
      2 train["Time_to_First_Email"]=train["Time_to_First_Email"].astype("int")
      3 train["Time_to_First_Email"] = np.where(train["Time_to_First_Email"] < 0, 0, train["Time_to_First_Email"])
      4 train["Time_since_Last_Email"]=(train["last_transaction_date"]-train["lastemailDatetime"]).dt.days
      5 train["Time_since_Last_Email"]=train["Time_since_Last_Email"].astype("int")

/usr/local/lib/python3.9/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   3805             if self.columns.nlevels > 1:
   3806                 return self._getitem_multilevel(key)
-> 3807             indexer = self.columns.get_loc(key)
   3808             if is_integer(indexer):
   3809                 indexer = [indexer]

/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:
-> 3804                 raise KeyError(key) from err
   3805             except TypeError:
   3806                 # If we have a listlike key, _check_indexing_error will raise

KeyError: 'first_transaction_date'
In [221]:
train["Time_to_First_Email"] = np.where(train["Time_to_First_Email"] < 0, 0, train["Time_to_First_Email"])

5)Customer-segmentation¶

RFM Segmentation: This approach segments customers based on their recency, frequency, and monetary value of their transactions. You already have some features related to transaction history, such as last_transaction_date, num_transactions, and OrderQuantity, which could be used to calculate these metrics. RFM segmentation can be helpful for identifying high-value customers, loyal customers, and customers who may be at risk of churn.

Behavioral Segmentation: This approach segments customers based on their behaviors and preferences, such as their preferred delivery day, online communication preferences, and response to email campaigns (MailOpened, MailClicked). You could also use features related to customer retention potential, such as RetentionPotential, to identify customer segments that are likely to require special attention or retention strategies.

Hybrid Segmentation: You could also combine multiple segmentation approaches to create a more comprehensive view of customer behavior and preferences. For example, you could combine RFM and Behavioral segmentation to identify high-value customers who are also responsive to email campaigns, or combine Behavioral and Demographic segmentation to identify customer segments with similar behavior and demographic characteristics.

1)RFMSEGMENTATION¶

In [130]:
#rfmsegmentation
#lets segment our customers bases upon their transaction records
#if we observe our last transaction was on date 2018-01-21
transaction["DateOfOrder"].describe()
Out[130]:
count                  104755
unique                   3151
top       2014-01-13 00:00:00
freq                      310
first     2008-07-29 00:00:00
last      2018-01-21 00:00:00
Name: DateOfOrder, dtype: object
In [131]:
#lets check days since first transaction for each record then we will figure out accordingly.
transaction['days_since'] =  pd.to_datetime(transaction['DateOfOrder']).apply(lambda x: (pd.to_datetime("2018-01-21") - x).days)
In [132]:
## Compute recency, frequency, and monetary_value
def rec_freq_value(x):
    recency        = x['days_since'].min()#how recently a customer was active
    first_purchase = x['days_since'].max()#out off all the days,the max difference is the first purchase he had made
    frequency      = x['days_since'].count()#how many times a customer had transacted between his first and last
    order_value = x['OrderQuantity'].mean()#how much quantity he had ordered,average of the total orders he had ordered
    c = ['recency', 'first_purchase', 'frequency', 'order_value']
    return pd.Series([recency, first_purchase, frequency, order_value], index = c)
In [133]:
#applying the function on my whole records
customer_purchases = transaction.groupby('CustomerID').apply(rec_freq_value)

kmeans clustering:-partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid)

In [134]:
#customer_purchases
In [135]:
from sklearn.preprocessing import StandardScaler

#Bring the data on same scale
scaleobj = StandardScaler()
Scaled_Data = scaleobj.fit_transform(customer_purchases)

#Transform it back to dataframe
Scaled_Data = pd.DataFrame(Scaled_Data, index = customer_purchases.index, columns = customer_purchases.columns)

why we scaled data here? it is essential step in clustering,as we have to bring all the columns onto same scale.

In [136]:
#Scaled_Data
In [137]:
from sklearn.cluster import KMeans

sum_of_sq_dist = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(Scaled_Data)
    sum_of_sq_dist[k] = km.inertia_
    
#Plot the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

AT k=5 it starts decreasing so here we will pick k as optimum no for our cluster

In [138]:
#Perform K-Mean Clustering or build the K-Means clustering model
KMean_clust = KMeans(n_clusters= 5, init= 'k-means++', max_iter= 1000)
KMean_clust.fit(Scaled_Data)

#Find the clusters for the observation given in the dataset
customer_purchases['Cluster'] = KMean_clust.labels_
customer_purchases.head()
Out[138]:
recency first_purchase frequency order_value Cluster
CustomerID
C12116 1564.0 1571.0 3.0 38.333333 1
C12117 1508.0 1508.0 1.0 41.000000 1
C12118 1594.0 1594.0 1.0 41.000000 1
C12119 1552.0 1552.0 1.0 52.000000 1
C12120 1472.0 1472.0 1.0 31.000000 1
In [139]:
#customer_purchases['Cluster'].value_counts()
In [140]:
from matplotlib import pyplot as plt
plt.figure(figsize=(7,7))

##Scatter Plot Frequency Vs Recency
Colors = ["red", "green", "blue","black","yellow"]
customer_purchases['Color'] = customer_purchases['Cluster'].map(lambda p: Colors[p])
ax = customer_purchases.plot(    
    kind="scatter", 
    x="recency", y="frequency",
    figsize=(10,8),
    c = customer_purchases['Color']
)
<Figure size 700x700 with 0 Axes>
In [141]:
customer_purchases['Color'].value_counts()
Out[141]:
green     13103
red        4170
black      3979
blue       2161
yellow      407
Name: Color, dtype: int64
In [142]:
customer_purchases['Cluster'].value_counts()
Out[142]:
1    13103
0     4170
3     3979
2     2161
4      407
Name: Cluster, dtype: int64
In [143]:
customer_purchases['Cluster Name'] = ''
customer_purchases['Cluster Name'][customer_purchases['Cluster']==0] = 'lessfrequent-midrecency'
customer_purchases['Cluster Name'][customer_purchases['Cluster']==1] = 'higherfrequency'
customer_purchases['Cluster Name'][customer_purchases['Cluster']==2] = 'lessfrequent,lessrecency'
customer_purchases['Cluster Name'][customer_purchases['Cluster']==3] = 'lessfrequent-higherrecency'
customer_purchases['Cluster Name'][customer_purchases['Cluster']==4] = 'lesserfrequent'
In [144]:
#customer_purchases

These are the categories we had obtained from RFM segmentation,but we cant solely base upon,so lets check RFM segmentation along with Customer email behaviour

2)RFM SEGMENTATION WITH CUSTOMER EMAIL BEHAVIOUR¶

In [145]:
RFM=transaction.groupby('CustomerID').apply(rec_freq_value)
In [146]:
RFM
Out[146]:
recency first_purchase frequency order_value
CustomerID
C12116 1564.0 1571.0 3.0 38.333333
C12117 1508.0 1508.0 1.0 41.000000
C12118 1594.0 1594.0 1.0 41.000000
C12119 1552.0 1552.0 1.0 52.000000
C12120 1472.0 1472.0 1.0 31.000000
... ... ... ... ...
C42185 2210.0 2484.0 13.0 103.076923
C42186 1520.0 1520.0 1.0 46.000000
C42187 1589.0 1589.0 1.0 41.000000
C42188 2301.0 2301.0 1.0 51.000000
C42189 1528.0 1718.0 2.0 90.000000

23820 rows × 4 columns

In [147]:
#lets combine email behaviour with rfm 
emailresponses=train[['CustomerID','Totalmails-notoponed', 'Totalmails-oponed', 'timesclicked-yes', 'timesclicked-no','numofemails']]
In [148]:
#rfm with email responsive
import pandas as pd
from sklearn.cluster import KMeans
import numpy as np

# Combine RFM scores and email response data
# Combine RFM scores and email response data
rfm_email = pd.merge(RFM,emailresponses, on=["CustomerID"])
rfm_email=rfm_email.drop(["CustomerID"],axis=1)

# Normalize the RFM scores and email response data
rfm_email_norm = (rfm_email - rfm_email.mean()) / rfm_email.std()
from sklearn.cluster import KMeans

sum_of_sq_dist = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(rfm_email_norm)
    sum_of_sq_dist[k] = km.inertia_
    
#Plot the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()
In [149]:
#Perform K-Mean Clustering or build the K-Means clustering model
KMean_clust = KMeans(n_clusters= 5, init= 'k-means++', max_iter= 1000)
KMean_clust.fit(rfm_email_norm)

#Find the clusters for the observation given in the dataset
rfm_email['Cluster'] = KMean_clust.labels_
rfm_email.head()
Out[149]:
recency first_purchase frequency order_value Totalmails-notoponed Totalmails-oponed timesclicked-yes timesclicked-no numofemails Cluster
0 1564.0 1571.0 3.0 38.333333 27 5 29 2 31 1
1 1508.0 1508.0 1.0 41.000000 0 1 0 0 1 4
2 1594.0 1594.0 1.0 41.000000 0 1 0 0 1 4
3 1552.0 1552.0 1.0 52.000000 28 14 37 4 41 3
4 1472.0 1472.0 1.0 31.000000 0 1 0 0 1 4
In [150]:
from matplotlib import pyplot as plt
plt.figure(figsize=(7,7))

##Scatter Plot Frequency Vs Recency
Colors = ["red", "green", "blue","black","yellow"]
rfm_email['Color'] = rfm_email['Cluster'].map(lambda p: Colors[p])
ax = rfm_email.plot(    
    kind="scatter", 
    x="recency", y="frequency",
    figsize=(10,8),
    c = rfm_email['Color']
)
<Figure size 700x700 with 0 Axes>
In [151]:
ax = rfm_email.plot(    
    kind="scatter", 
    x="order_value", y="frequency",
    figsize=(10,8),
    c = rfm_email['Color']
)
In [152]:
rfm_email['Cluster Name'] = ''
rfm_email['Cluster Name'][rfm_email['Cluster']==0] = 'midfrequency'
rfm_email['Cluster Name'][rfm_email['Cluster']==1] = 'Higherrecency-lessfrequent '
rfm_email['Cluster Name'][rfm_email['Cluster']==2] = 'recenttransactionswithlessfrequency'
rfm_email['Cluster Name'][rfm_email['Cluster']==3] = 'midfrequencywithrecenttransactionatintervals'
rfm_email['Cluster Name'][rfm_email['Cluster']==4] = 'higherfrequencyatdifferentintervals'

Assigning the segments we got from clustering analysis to main dataframe.

In [153]:
train = train.reset_index().drop('index', axis=1)
train['Cluster Name'] = rfm_email['Cluster Name']
train['Cluster Name']=train['Cluster Name'].astype("category")

Lets now try building different models for our customer retention potenial.

In [154]:
## Custom Function for Bar Plots

def barplot(column,df):
    bar_plot1 = sns.countplot(x=column, data=df)
    
    total = len(df[column])
    for p in bar_plot1.patches:
        percentage = '{:.2f}%'.format(100 * p.get_height()/total)
        height = p.get_height()
        bar_plot1.text(p.get_x()+ p.get_width()/2, height + 400, percentage, ha="center")
In [155]:
train[['totalorders','Totalorderquantity_in_first_quarter',
       'Totalorderquantity_in_second_quarter',
       'Totalorderquantity_in_Third_quarter']].describe()
Out[155]:
totalorders Totalorderquantity_in_first_quarter Totalorderquantity_in_second_quarter Totalorderquantity_in_Third_quarter
count 23820.000000 23820.000000 23820.000000 23820.000000
mean 357.622922 96.159446 109.537280 151.926196
std 1335.649680 417.061493 472.822847 495.469196
min 0.000000 0.000000 0.000000 0.000000
25% 41.000000 0.000000 0.000000 0.000000
50% 65.000000 0.000000 0.000000 41.000000
75% 169.000000 41.000000 52.000000 93.000000
max 51429.000000 13982.000000 20228.000000 17219.000000
In [156]:
sns.heatmap(train[['totalorders','Totalorderquantity_in_first_quarter',
       'Totalorderquantity_in_second_quarter',
       'Totalorderquantity_in_Third_quarter']].corr(),annot=True)
Out[156]:
<Axes: >
In [157]:
# Loop over the rows of the dataframe and create a new column 'most_ordered_quarter'
for index, row in train.iterrows():
    if row['Totalorderquantity_in_first_quarter'] > row['Totalorderquantity_in_second_quarter'] and row['Totalorderquantity_in_first_quarter'] > row['Totalorderquantity_in_Third_quarter']:
        train.at[index, 'most_ordered_quarter'] = 'Q1'
    elif row['Totalorderquantity_in_second_quarter'] > row['Totalorderquantity_in_first_quarter'] and row['Totalorderquantity_in_second_quarter'] > row['Totalorderquantity_in_Third_quarter']:
        train.at[index, 'most_ordered_quarter'] = 'Q2'
    else:
        train.at[index, 'most_ordered_quarter'] = 'Q3'
In [158]:
train['most_ordered_quarter']=train['most_ordered_quarter'].astype("category")
In [159]:
barplot('most_ordered_quarter',train)
In [160]:
cat_cols = train.select_dtypes('category').columns
cat_cols
Out[160]:
Index(['CustomerID', 'City', 'OnlineCommunication', 'AutomaticRefill',
       'DoorstepDelivery', 'PreferredDeliveryDay', 'RetentionPotential',
       'EmailType', 'MailOpened', 'MailClicked',
       'days_since_first_transaction_category', 'email_duration_category',
       'customer_segment', 'Cluster Name', 'most_ordered_quarter'],
      dtype='object')

6)Feature Selection¶

In [161]:
def find_chi2_independence(cat_col, target, df, alpha=0.05):
    data = df[[target, cat_col]]
    tab = sm.stats.Table.from_data(data)
    tab = tab.table_orig.to_numpy()
    print(f"---------------{target} Vs {cat_col} Chi Square Test of Independence -------------------")
    print(f"\n Contingency table :\n")
    print(tab)
    
    stat, p, dof, expected = chi2_contingency(tab)
    print(f"\n Expected table :\n")
    print(expected)
    
    print(f"The p value returned = {p} and degrees of freedom returned = {dof}")
    
    # interpret p-value
    print('significance(alpha) = %.3f' % (alpha))

    if p <= alpha:
        print('Dependent (reject H0)')
    else:
        print('Independent (fail to reject H0)') 
In [162]:
for col in cat_cols:
    find_chi2_independence(col, 'RetentionPotential', train)
---------------RetentionPotential Vs CustomerID Chi Square Test of Independence -------------------

 Contingency table :

[[0 0 1 ... 1 0 0]
 [1 1 0 ... 0 1 1]
 [0 0 0 ... 0 0 0]]

 Expected table :

[[0.15768262 0.15768262 0.15768262 ... 0.15768262 0.15768262 0.15768262]
 [0.80541562 0.80541562 0.80541562 ... 0.80541562 0.80541562 0.80541562]
 [0.03690176 0.03690176 0.03690176 ... 0.03690176 0.03690176 0.03690176]]
The p value returned = 0.4965535116908812 and degrees of freedom returned = 47638
significance(alpha) = 0.050
Independent (fail to reject H0)
---------------RetentionPotential Vs City Chi Square Test of Independence -------------------

 Contingency table :

[[ 125 1257  866 1508]
 [1155 5168 5503 7359]
 [  28  454  121  276]]

 Expected table :

[[ 206.2488665  1084.69874055 1023.36020151 1441.69219144]
 [1053.4836272  5540.45403023 5227.14735516 7363.91498741]
 [  48.2675063   253.84722922  239.49244332  337.39282116]]
The p value returned = 2.7318851965917284e-77 and degrees of freedom returned = 6
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs OnlineCommunication Chi Square Test of Independence -------------------

 Contingency table :

[[ 1893  1863]
 [ 5740 13445]
 [  456   423]]

 Expected table :

[[ 1275.49471033  2480.50528967]
 [ 6515.00692695 12669.99307305]
 [  298.49836272   580.50163728]]
The p value returned = 1.1575020231150796e-156 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs AutomaticRefill Chi Square Test of Independence -------------------

 Contingency table :

[[ 3756     0]
 [17048  2137]
 [  760   119]]

 Expected table :

[[ 3400.26801008   355.73198992]
 [17367.98236776  1817.01763224]
 [  795.74962217    83.25037783]]
The p value returned = 2.9763053132377616e-103 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs DoorstepDelivery Chi Square Test of Independence -------------------

 Contingency table :

[[ 3737    19]
 [18340   845]
 [  847    32]]

 Expected table :

[[ 3614.7163728    141.2836272 ]
 [18463.34760705   721.65239295]
 [  845.93602015    33.06397985]]
The p value returned = 2.2611134234358053e-29 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs PreferredDeliveryDay Chi Square Test of Independence -------------------

 Contingency table :

[[ 654  757  675  854  636  180]
 [3387 4328 3232 4273 2990  975]
 [ 126  215  136  216  152   34]]

 Expected table :

[[ 657.06347607  835.71788413  637.51083123  842.49823678  595.72493703
   187.48463476]
 [3356.16687657 4268.70277078 3256.29534005 4303.33564232 3042.86020151
   957.63916877]
 [ 153.76964736  195.57934509  149.19382872  197.16612091  139.41486146
    43.87619647]]
The p value returned = 0.001334597616908662 and degrees of freedom returned = 10
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs RetentionPotential Chi Square Test of Independence -------------------

 Contingency table :

[[ 3756     0     0]
 [    0 19185     0]
 [    0     0   879]]

 Expected table :

[[  592.2559194   3025.14105793   138.60302267]
 [ 3025.14105793 15451.89861461   707.96032746]
 [  138.60302267   707.96032746    32.43664987]]
The p value returned = 0.0 and degrees of freedom returned = 4
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs EmailType Chi Square Test of Independence -------------------

 Contingency table :

[[  859  2897]
 [12908  6277]
 [  344   535]]

 Expected table :

[[ 2225.05944584  1530.94055416]
 [11365.2197733   7819.7802267 ]
 [  520.72078086   358.27921914]]
The p value returned = 0.0 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs MailOpened Chi Square Test of Independence -------------------

 Contingency table :

[[ 495 3261]
 [9403 9782]
 [ 173  706]]

 Expected table :

[[ 1588.02166247  2167.97833753]
 [ 8111.3406801  11073.6593199 ]
 [  371.63765743   507.36234257]]
The p value returned = 0.0 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs MailClicked Chi Square Test of Independence -------------------

 Contingency table :

[[ 2897   786    73]
 [ 6277 12138   770]
 [  535   267    77]]

 Expected table :

[[ 1530.94055416  2079.99143577   145.06801008]
 [ 7819.7802267  10624.23740554   740.98236776]
 [  358.27921914   486.77115869    33.94962217]]
The p value returned = 0.0 and degrees of freedom returned = 4
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs days_since_first_transaction_category Chi Square Test of Independence -------------------

 Contingency table :

[[  188   429  3033   106]
 [ 3399  3331 11880   575]
 [  524   229   100    26]]

 Expected table :

[[  648.23324937   628.99596977  2367.28916877   111.48161209]
 [ 3311.06360202  3212.80289673 12091.70465995   569.42884131]
 [  151.70314861   147.2011335    554.00617128    26.0895466 ]]
The p value returned = 0.0 and degrees of freedom returned = 6
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs email_duration_category Chi Square Test of Independence -------------------

 Contingency table :

[[ 192 3116  354   94]
 [1669 7251 5569 4696]
 [  19  587  103  170]]

 Expected table :

[[ 296.44332494 1727.25541562  950.19546599  782.10579345]
 [1514.1813602  8822.52267003 4853.43450882 3994.86146096]
 [  69.37531486  404.22191436  222.37002519  183.03274559]]
The p value returned = 0.0 and degrees of freedom returned = 6
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs customer_segment Chi Square Test of Independence -------------------

 Contingency table :

[[1791 1965]
 [9916 9269]
 [ 637  242]]

 Expected table :

[[1946.43425693 1809.56574307]
 [9942.05037783 9242.94962217]
 [ 455.51536524  423.48463476]]
The p value returned = 6.094259288968507e-39 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs Cluster Name Chi Square Test of Independence -------------------

 Contingency table :

[[1082 1562  558  486   68]
 [5477 8004 2797 2620  287]
 [ 242  384  131  114    8]]

 Expected table :

[[1072.39949622 1568.94206549  549.68161209  507.73803526   57.23879093]
 [5477.63161209 8013.88539043 2807.67884131 2593.43828715  292.36586902]
 [ 250.96889169  367.17254408  128.6395466   118.82367758   13.39534005]]
The p value returned = 0.5233392144502882 and degrees of freedom returned = 8
significance(alpha) = 0.050
Independent (fail to reject H0)
---------------RetentionPotential Vs most_ordered_quarter Chi Square Test of Independence -------------------

 Contingency table :

[[  547  1048  2161]
 [ 3289  3703 12193]
 [  208   254   417]]

 Expected table :

[[  637.66851385   789.20151134  2329.12997481]
 [ 3257.10075567  4031.10516373 11896.7940806 ]
 [  149.23073048   184.69332494   545.07594458]]
The p value returned = 3.2547730739300043e-47 and degrees of freedom returned = 4
significance(alpha) = 0.050
Dependent (reject H0)

From chi square test analysis we are sure that we will remove the column clustername as it is independent to our analysis and it had fail to reject h0.

In [163]:
#clustername-we should remove this as it is not adding benifit to our analysis
In [164]:
train=train.drop(['Cluster Name'],axis=1)

FEATURE ENGINEERING

In [166]:
train['Registration_month']=train['DateOfRegistration'].dt.month
train['Registration_weekday']=train['DateOfRegistration'].dt.weekday
train['lastemail_month']=train['lastemailDatetime'].dt.month
train['lastemail_weekday']=train['lastemailDatetime'].dt.weekday
train['transaction_month']=train['last_transaction_date'].dt.month
train['transaction_weekday']=train['last_transaction_date'].dt.weekday
train=train.drop(["last_transaction_date","lastemailDatetime","DateOfRegistration"],axis=1)
In [167]:
datetypecolumns
train=train.drop(['firstemailDatetime',"first_transaction_date","emailDatetime"],axis=1)
In [168]:
for col in ['Registration_month',
       'Registration_weekday', 'lastemail_month', 'lastemail_weekday',
       'transaction_month', 'transaction_weekday']:
   train[col] =train[col].astype('category')
In [169]:
barplot('Registration_month',train)
In [170]:
barplot('Registration_weekday',train)
In [171]:
barplot('lastemail_month',train)
In [172]:
barplot('lastemail_weekday',train)
In [173]:
barplot('transaction_month',train)
In [174]:
barplot('transaction_weekday',train)
In [175]:
barplot('RegistrationYear',train)

Highlights from the plots above 1)When lesser mails are being lesser ammount of people are intrested in getting engaged in transactions.So engagment with customers is crucial

2)Customers acquired were highest in 2011,2013,2017

3)if we observe there are more transactions on weekends than on week days,while thursday,friday being the least.

4)if we observe transactions monthly we can see that during the start of winter season,i.e,there is a exponential growth of transactions till the end of january,so engagment with customers during this period becomes becomes crucial,and people registering are also highest during this period.

5)while registration,trasactions during summer period were least during this period.

In [176]:
train["RegistrationYear"]=train["RegistrationYear"].astype("category")
In [177]:
#feature elimination for numerical columns
train=train.drop(['Totalorderquantity_in_first_quarter',
       'Totalorderquantity_in_second_quarter',
       'Totalorderquantity_in_Third_quarter','minimumorderquantity', 'maximumorderquantity','averageorders'],axis=1)
In [178]:
num_cols=train.select_dtypes(include=['int64', 'float64'])
plt.figure(figsize=(10,12))
sns.heatmap(num_cols.corr(),annot=True)
Out[178]:
<Axes: >

if we observe correlation for num of emails we can see that,for total mails not oponed,oponed,time clicked yes or no has highest correlation with num of mails,we can convert them into categorical columns like we did earlier lets try to do it

In [179]:
train['open_rate'] = np.log(train['Totalmails-oponed'] / train['numofemails'])
train['click_through_rate'] = np.log(train['totalorders'] / train['Totalmails-oponed'])
train['response_rate'] = np.log(train['totalorders'] / train['numofemails'])
train['open_rate_ratio'] =np.log(train['Totalmails-oponed'] / train['Totalmails-notoponed'])
train['click_through_rate_ratio'] = np.log(train['timesclicked-yes'] / train['timesclicked-no'])
train['response_rate_ratio'] = np.log(train['totalorders']) / (train['numofemails'] - train['totalorders'])
In [180]:
train[['open_rate',
       'click_through_rate', 'response_rate', 'open_rate_ratio',
       'click_through_rate_ratio', 'response_rate_ratio']].describe()
Out[180]:
open_rate click_through_rate response_rate open_rate_ratio click_through_rate_ratio response_rate_ratio
count 23820.000000 2.382000e+04 2.382000e+04 2.382000e+04 1.411100e+04 2.377600e+04
mean -0.962155 -inf -inf inf NaN NaN
std 1.184260 NaN NaN NaN NaN NaN
min -4.290459 -inf -inf -4.290459e+00 -inf -inf
25% -1.824549 2.708050e+00 1.082514e+00 -1.686399e+00 2.335375e+00 -9.692826e-02
50% -0.342945 3.713572e+00 2.819948e+00 6.931472e-01 3.555348e+00 -7.635177e-02
75% 0.000000 4.094345e+00 3.713572e+00 NaN NaN -2.853937e-02
max 0.693147 1.005410e+01 9.879758e+00 inf inf inf
In [181]:
# Replace inf and NaN values in selected columns
cols = ['open_rate_ratio', 'click_through_rate_ratio', 'response_rate_ratio','click_through_rate','open_rate','response_rate']
replace_value = 0 # Replace with NaN, or any other value you want
train[cols] = train[cols].replace([np.inf, -np.inf], 0)
train[cols] = train[cols].fillna(replace_value)
In [182]:
train=train.drop(['Totalmails-notoponed','Totalmails-oponed', 'timesclicked-yes', 'timesclicked-no'],axis=1)
In [183]:
#lets do modelling
In [184]:
train = train.rename(columns={'totalorders': 'sumoforderquantity'})

7)a)Key points from visualizations without target variable.¶

1)Monthlyorders,Yearlyorders,Montly mails,Yearly mails were highest during the period of 2013-2014.

2)Orders were more than 200000 in jan,september,october,november,december,and less than 200000 in other months.

3)For city 1 if we see number of customers per city are far too less compared to city 2,city3 and city4,from this we can draw out few conclusions which might or might not be true.

4)If we observe here city 2 has highest number of customers with medium retention potenial,so here we should not lose those customers aas they are valuable for our analysis.

5)City 4 is having 7359 customers with low retention potenial.Their risk of churning is very higher compared to other cities.so,reaccesment should take place such that churning do not happen.

6)Average order value from city1 is higher than city3 and there are less customers in city 1 compared to others,and city 3 has lesser orderquantity during 2nd quarter of a year.

7)Till 2013 customers registrations,customers acquired exponentially,but after 2013 suddenly registration were dropped,and in 2016 again it increased and then dropped. Here we have to find out the reason why in year 2013, no. of registrations were dropped.Earlier,if we observed our transaction and mails visualizations it had also shown that it was good during the period of 2013-2014 but after that it had dropped,no of registrations might be one of the potenial reasons.

8)People who have received only mail,880 customers had not even oponed the mail.people who received only mail are people with welcome mails,we can understand that 880 customers were not intrested as they had not oponed.

9)Customers with long term history are lesser likely to have online communication.

10)Customers with moderate history,long term history have few customers with medium retention possibilty.this would be useful for our analysis as our focus is on customers with medium retention potenial because if we can concentrate on them we can improve our business,as they had already been customer since so much time we have to plan something for them such that they would remain loyal customers.

b)Key points from visualizations with target variable.¶

1)Online communication -Medium(target=retentionpotenial(medium) have a lesser percentage of customers who are preferring online communication

2)Automatic refill -High(target=retentionpotenial(high)) has not at all preffered to have automatic refill.

3)Doorstep delivery - customers across all the targets had the same ammount of preference,most of them wanted to have doorstep delivery.

4)Mail oponed -Low(target=retentionpotenial(low) has compartively higher percentage of customers not opening the mail.

5)Mail clicked-medium (target=retentionpotenial(medium) has clicked the mail most than the others.

6)days_since_first_transaction category:- a)only transaction:-customers with higher chances of retention are more. b)moderate history:-Customers with medium chances of retention are higher. c)Long-term history:-Customers with medium chances of retention are higher. d)Recent -transaction :-customers were nearly equal across all the categories of retention potenial.

7)Email_duration_category:- a)Only email:-High(retentionpotenial(high) has more no of customers with only email sent b)Long mail history:-LOW(retentionpotenial(low)) had more no of customers with long mail history. c)very long mail history:LOW(retentionpotenial(low)) had more no of customers with very long mail history. d)medium history:-it is mostly same across all the graphs.

8)Customersegment:-if we observe here large order customers are more than the small order customers for retention potenial-medium customers

9)Here we are comparing order quantities by no. of registrations every 2 years. If we observe year 2010 had been the highest ordered quantity for medium customers and high customers Intresting point is that the customers who are more likely to churn have the highest oreder quantity rate in 2018.

10)There are 355 medium retention potenial customers with long term transaction history and only mail being sent to them.Customers with medium retention potenial and long term transaction history are compartively more than the others,so we can target this customers and prevent them from churning.

11)There are 740 customers who had not prefered doorstepdelivery and no automatic refill which there might be some underlying factors,so maybe in near future company should improve the following services.

In [185]:
# split into input (X) and output (y) variables
X=train.drop(['CustomerID','RetentionPotential','days_since_first_transaction_category', 'email_duration_category'], axis=1)
y=train["RetentionPotential"]
In [186]:
#label encoding of target variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y=le.fit_transform(y)
In [187]:
cat_cols=train.select_dtypes(include=['category'])
cat_cols=cat_cols.drop(["CustomerID","RetentionPotential",'days_since_first_transaction_category', 'email_duration_category'],axis=1)
cat_cols=cat_cols.columns
In [188]:
num_cols=train.select_dtypes(include=['int64', 'float64'])
num_cols=num_cols.drop(['open_rate','click_through_rate', 'response_rate', 'open_rate_ratio','click_through_rate_ratio', 'response_rate_ratio'],axis=1)
num_cols=num_cols.columns
In [189]:
 ## Convert Categorical Columns to Dummies
X = pd.get_dummies(X, columns=cat_cols,drop_first=True)
#scaling 
scaler = StandardScaler()
# scale on train
X[num_cols] =scaler.fit_transform(X[num_cols])
In [190]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify = y,test_size=0.25,random_state=107,shuffle=True)
print('X_Train Size:', (X_train.shape))
print('X _test Size:', (X_test.shape))
print('Y_train Size:', (y_train.shape))
print('Y_test Size:', (y_test.shape))
X_Train Size: (17865, 97)
X _test Size: (5955, 97)
Y_train Size: (17865,)
Y_test Size: (5955,)

8)Defining error metrics¶

In [191]:
# Function for Classification Report

def classifcation_report_train_test(y_train, y_train_pred, y_test, y_test_pred):

    print('''
            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            ''')
    print(classification_report(y_train, y_train_pred, digits=4))

    print('''
            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            ''')
    print(classification_report(y_test, y_test_pred, digits=4))
In [192]:
#defining error metrics
scores = pd.DataFrame(columns=['Model','Train_Accuracy','Train_Recall','Test_Accuracy','Test_Recall'])

def get_metrics(train_actual,train_predicted,test_actual,test_predicted,model_description,dataframe):
    train_accuracy = accuracy_score(train_actual,train_predicted)
    train_recall   = recall_score(train_actual,train_predicted,average="macro")
    test_accuracy = accuracy_score(test_actual,test_predicted)
    test_recall   = recall_score(test_actual,test_predicted,average="macro")
    dataframe = dataframe.append(pd.Series([model_description, train_accuracy,train_recall,
                                            test_accuracy,test_recall],
                                           index=scores.columns ), ignore_index=True)
    return(dataframe)

9)Model building¶

We will experiment different models until we find out a generalized model out of all.

1)naivebayes classifier

2)decision tree classifier

3)random forest classifier

4)support vector machines classifier

5)XGBOOST

6)boosting techniques

7)neuralnetworks

8)knn classifier

we will choose the best model from all and then try to draw learning patterns for all this models.

a)Models¶

1)Naive bayes classifier¶

In [194]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import recall_score
from sklearn import svm
NB1 =  GaussianNB()
NB1=NB1.fit(X_train, y_train)
y_pred_train =NB1.predict(X_train)
y_pred_train[0:10]
# Make predictions on the testing set and evaluate accuracy
y_pred_test = NB1.predict(X_test)
y_pred_test[0:10]
print("trainaccuracy",accuracy_score(y_train,y_pred_train))
print("testaccuracy",accuracy_score(y_test,y_pred_test))
print("recall-train",recall_score(y_train,y_pred_train,average='macro'))
print("recall-test",recall_score(y_test,y_pred_test,average='macro'))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"GaussianNB",scores)
scores
trainaccuracy 0.6920794850265883
testaccuracy 0.6886649874055416
recall-train 0.7184312500115704
recall-test 0.7101471229574522

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.3585    0.8931    0.5116      2817
           1     0.9657    0.6567    0.7817     14389
           2     0.3761    0.6055    0.4640       659

    accuracy                         0.6921     17865
   macro avg     0.5667    0.7184    0.5858     17865
weighted avg     0.8482    0.6921    0.7274     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.3561    0.9010    0.5104       939
           1     0.9628    0.6522    0.7776      4796
           2     0.3848    0.5773    0.4618       220

    accuracy                         0.6887      5955
   macro avg     0.5679    0.7101    0.5833      5955
weighted avg     0.8457    0.6887    0.7238      5955

Out[194]:
Model Train_Accuracy Train_Recall Test_Accuracy Test_Recall
0 GaussianNB 0.692079 0.718431 0.688665 0.710147
In [195]:
from sklearn.model_selection import learning_curve
from sklearn.model_selection import StratifiedKFold
# Define the stratified k-fold cross-validator
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_sizes, train_scores, test_scores = learning_curve(NB1, X_train, y_train, cv=skf, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best")
plt.show()

As per observation from learning curve,Naive bayes classifier models seems to underfit.The reason for our model underfitting could be lesser training data,but if we observe classification report we could see that recall for target label is also low.So,we will further decide on checking other models.Here we are not getting desired perfomance.

2)Decision tree classifier¶

In [196]:
clf_dt = DecisionTreeClassifier()
clf_dt.fit(X=X_train, y= y_train)

y_pred_train = clf_dt.predict(X_train)
y_pred_test = clf_dt.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"Decisiontree",scores)
scores
Train Accuracy: 0.9876294430450602
Test Accuracy: 0.8777497900923593

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.9493    0.9769    0.9629      2817
           1     0.9949    0.9897    0.9923     14389
           2     0.9985    0.9879    0.9931       659

    accuracy                         0.9876     17865
   macro avg     0.9809    0.9848    0.9828     17865
weighted avg     0.9878    0.9876    0.9877     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.6302    0.6880    0.6578       939
           1     0.9326    0.9206    0.9265      4796
           2     0.8469    0.7545    0.7981       220

    accuracy                         0.8777      5955
   macro avg     0.8033    0.7877    0.7942      5955
weighted avg     0.8818    0.8777    0.8794      5955

Out[196]:
Model Train_Accuracy Train_Recall Test_Accuracy Test_Recall
0 GaussianNB 0.692079 0.718431 0.688665 0.710147
1 Decisiontree 0.987629 0.984834 0.877750 0.787690
In [197]:
from sklearn.model_selection import learning_curve

train_sizes, train_scores, test_scores = learning_curve(clf_dt, X_train, y_train, cv=skf, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best")
plt.show()

As per observation from our learning curve we could see that model is overfitting.Best models should always generalize well when faced with instances that were not part of the initial training data.There is a large gap between training and validation data.To adress this issue we can either add more data or do hyperparamter tuning for better results.

3)Randomforestclassifer¶

In [198]:
# Import Randomforest Library

from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier()
clf_rf.fit(X=X_train, y= y_train)

y_pred_train = clf_rf.predict(X_train)
y_pred_test = clf_rf.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"randomforest",scores)
scores
Train Accuracy: 0.9875734676742234
Test Accuracy: 0.9079764903442485

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.9780    0.9457    0.9616      2817
           1     0.9889    0.9958    0.9923     14389
           2     0.9985    0.9879    0.9931       659

    accuracy                         0.9876     17865
   macro avg     0.9884    0.9764    0.9823     17865
weighted avg     0.9875    0.9876    0.9875     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.8099    0.6124    0.6974       939
           1     0.9203    0.9750    0.9468      4796
           2     0.9512    0.7091    0.8125       220

    accuracy                         0.9080      5955
   macro avg     0.8938    0.7655    0.8189      5955
weighted avg     0.9040    0.9080    0.9025      5955

Out[198]:
Model Train_Accuracy Train_Recall Test_Accuracy Test_Recall
0 GaussianNB 0.692079 0.718431 0.688665 0.710147
1 Decisiontree 0.987629 0.984834 0.877750 0.787690
2 randomforest 0.987573 0.976436 0.907976 0.765475

It is also example of overfiiting as there is a huge gap between training and validation sets.

In [199]:
from sklearn.model_selection import learning_curve

train_sizes, train_scores, test_scores = learning_curve(clf_rf, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best")
plt.show()

As per observation from our learning curve we could see that model is overfitting.Best models should always generalize well when faced with instances that were not part of the initial training data.There is a large gap between training and validation data.To adress this issue we can either add more data or do hyperparamter tuning for better results.

In [200]:
importances = clf_rf.feature_importances_
importances
np.argsort(importances)[::-1]
indices = np.argsort(importances)[::-1]
ind_attr_names = X_train.columns
pd.DataFrame([ind_attr_names[indices], np.sort(importances)[::-1]])
Out[200]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
0 recency Time_since_Last_Email numofemails response_rate Time_to_First_Email open_rate duration_of_customer_relation click_through_rate sumoforderquantity response_rate_ratio open_rate_ratio EmailType_WelcomeEmail OrderQuantity OnlineCommunication_yes num_transactions email_duration average_time_between_mails days_since_first_transaction click_through_rate_ratio PreferredDeliveryDay_Tuesday PreferredDeliveryDay_Wednesday PreferredDeliveryDay_Monday PreferredDeliveryDay_Thursday MailOpened_yes MailClicked_no average_time_between_purchase AutomaticRefill_yes City_CITY4 City_CITY2 City_CITY3 RegistrationYear_2013 most_ordered_quarter_Q2 RegistrationYear_2011 Registration_month_9 RegistrationYear_2017 transaction_month_9 customer_segment_small_order_customer most_ordered_quarter_Q3 transaction_weekday_1 Registration_weekday_3 Registration_weekday_1 lastemail_month_9 lastemail_weekday_2 transaction_weekday_2 lastemail_weekday_1 transaction_weekday_3 Registration_weekday_2 RegistrationYear_2014 PreferredDeliveryDay_weekend lastemail_weekday_3 lastemail_weekday_6 transaction_weekday_6 lastemail_weekday_4 Registration_weekday_6 Registration_weekday_4 Registration_month_8 transaction_weekday_4 lastemail_month_12 Registration_weekday_5 transaction_weekday_5 lastemail_month_10 lastemail_weekday_5 RegistrationYear_2010 RegistrationYear_2012 Registration_month_10 transaction_month_10 MailClicked_yes lastemail_month_8 Registration_month_12 transaction_month_11 lastemail_month_11 Registration_month_11 RegistrationYear_2015 transaction_month_8 transaction_month_12 DoorstepDelivery_yes lastemail_month_5 Registration_month_7 Registration_month_5 transaction_month_6 transaction_month_7 Registration_month_6 lastemail_month_7 lastemail_month_6 transaction_month_5 RegistrationYear_2009 Registration_month_4 lastemail_month_3 transaction_month_2 Registration_month_3 transaction_month_4 Registration_month_2 transaction_month_3 lastemail_month_4 lastemail_month_2 RegistrationYear_2016 RegistrationYear_2018
1 0.080549 0.074629 0.07201 0.06425 0.048779 0.046031 0.040313 0.039392 0.038414 0.037904 0.027456 0.025767 0.024871 0.023205 0.021229 0.020673 0.019571 0.015557 0.014539 0.010885 0.010312 0.010236 0.010111 0.010066 0.009433 0.009006 0.008966 0.00853 0.007734 0.007527 0.006926 0.005187 0.004619 0.004591 0.004321 0.004318 0.003724 0.003706 0.003638 0.003621 0.003619 0.003505 0.003488 0.003415 0.003393 0.003344 0.003334 0.003203 0.003192 0.00315 0.003083 0.003067 0.003012 0.002975 0.002961 0.002919 0.002823 0.002699 0.00263 0.0026 0.002581 0.002463 0.002308 0.002238 0.002232 0.002213 0.002098 0.002087 0.002046 0.001989 0.001966 0.001933 0.001882 0.001784 0.001651 0.001415 0.001403 0.001311 0.001285 0.001247 0.001227 0.001225 0.001169 0.001094 0.001083 0.00108 0.001022 0.000999 0.000999 0.000996 0.000987 0.00096 0.000925 0.000912 0.000867 0.000775 0.000543

4)SVMCLASSIFIER¶

In [201]:
from sklearn.svm import SVC 
clf_svc = SVC()
clf_svc.fit(X=X_train, y= y_train)

y_pred_train = clf_svc.predict(X_train)
y_pred_test = clf_svc.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"SVMCLASSIFIER",scores)
scores
Train Accuracy: 0.8963336132101876
Test Accuracy: 0.8960537363560034

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.7425    0.5815    0.6522      2817
           1     0.9171    0.9681    0.9419     14389
           2     0.9488    0.6753    0.7890       659

    accuracy                         0.8963     17865
   macro avg     0.8695    0.7416    0.7944     17865
weighted avg     0.8907    0.8963    0.8906     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.7552    0.5783    0.6550       939
           1     0.9146    0.9693    0.9412      4796
           2     0.9412    0.6545    0.7721       220

    accuracy                         0.8961      5955
   macro avg     0.8703    0.7341    0.7894      5955
weighted avg     0.8905    0.8961    0.8898      5955

Out[201]:
Model Train_Accuracy Train_Recall Test_Accuracy Test_Recall
0 GaussianNB 0.692079 0.718431 0.688665 0.710147
1 Decisiontree 0.987629 0.984834 0.877750 0.787690
2 randomforest 0.987573 0.976436 0.907976 0.765475
3 SVMCLASSIFIER 0.896334 0.741612 0.896054 0.734057
In [384]:
from sklearn.model_selection import learning_curve

train_sizes, train_scores, test_scores = learning_curve(clf_svc, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best")
plt.show()

This comes under a generalized model,we will further analyze by learning curves while doing hyper parameter tuning,but the problem with svm is it takes so much time,we will check for alternative models if we didnt find any then we will use svm classifier

5)XGBOOST CLASSIFIER¶

In [202]:
# Import XGBClassifier Library

from xgboost import XGBClassifier
clf_XGB = XGBClassifier()
clf_XGB.fit(X=X_train, y= y_train)

y_pred_train = clf_XGB.predict(X_train)
y_pred_test = clf_XGB.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"XGBOOST",scores)
scores
Train Accuracy: 0.9540442205429611
Test Accuracy: 0.9182199832073887

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.9415    0.7654    0.8443      2817
           1     0.9540    0.9907    0.9720     14389
           2     1.0000    0.9605    0.9799       659

    accuracy                         0.9540     17865
   macro avg     0.9652    0.9055    0.9321     17865
weighted avg     0.9537    0.9540    0.9522     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.8375    0.6475    0.7303       939
           1     0.9286    0.9756    0.9515      4796
           2     0.9526    0.8227    0.8829       220

    accuracy                         0.9182      5955
   macro avg     0.9062    0.8153    0.8549      5955
weighted avg     0.9151    0.9182    0.9141      5955

Out[202]:
Model Train_Accuracy Train_Recall Test_Accuracy Test_Recall
0 GaussianNB 0.692079 0.718431 0.688665 0.710147
1 Decisiontree 0.987629 0.984834 0.877750 0.787690
2 randomforest 0.987573 0.976436 0.907976 0.765475
3 SVMCLASSIFIER 0.896334 0.741612 0.896054 0.734057
4 XGBOOST 0.954044 0.905529 0.918220 0.815276

XGBOOST CLASSIFIER is overfitting,which means it might not be good when it is predicting test data or unseen instances.We will further see and check whether we should use this or not

ADABOOST CLASSIFIER

In [203]:
clf_ada = AdaBoostClassifier(base_estimator=GaussianNB(),random_state=123)
clf_ada.fit(X_train, y_train)


y_pred_train = clf_ada.predict(X_train)
y_pred_test = clf_ada.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"AdaBoostClassifier",scores)
scores
Train Accuracy: 0.7385950181919955
Test Accuracy: 0.7390428211586901

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.3239    0.3039    0.3136      2817
           1     0.8428    0.8456    0.8442     14389
           2     0.2188    0.2610    0.2381       659

    accuracy                         0.7386     17865
   macro avg     0.4618    0.4701    0.4653     17865
weighted avg     0.7380    0.7386    0.7382     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.2953    0.2673    0.2806       939
           1     0.8395    0.8543    0.8468      4796
           2     0.2356    0.2409    0.2382       220

    accuracy                         0.7390      5955
   macro avg     0.4568    0.4542    0.4552      5955
weighted avg     0.7314    0.7390    0.7351      5955

Out[203]:
Model Train_Accuracy Train_Recall Test_Accuracy Test_Recall
0 GaussianNB 0.692079 0.718431 0.688665 0.710147
1 Decisiontree 0.987629 0.984834 0.877750 0.787690
2 randomforest 0.987573 0.976436 0.907976 0.765475
3 SVMCLASSIFIER 0.896334 0.741612 0.896054 0.734057
4 XGBOOST 0.954044 0.905529 0.918220 0.815276
5 AdaBoostClassifier 0.738595 0.470149 0.739043 0.454156

An underfitting model with lesser recall score for target '2' in both train and validation sets

In [203]:
 

6)GRADIENTBOOSTINGCLASSIFIER¶

In [204]:
clf_GBM = GradientBoostingClassifier()
clf_GBM.fit(X=X_train, y=y_train)

y_pred_train = clf_GBM.predict(X_train)
y_pred_test = clf_GBM.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"GBM",scores)
Train Accuracy: 0.9196193674783095
Test Accuracy: 0.911167086481948

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.8710    0.6134    0.7199      2817
           1     0.9241    0.9836    0.9529     14389
           2     0.9682    0.8316    0.8947       659

    accuracy                         0.9196     17865
   macro avg     0.9211    0.8095    0.8558     17865
weighted avg     0.9174    0.9196    0.9140     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.8480    0.5942    0.6988       939
           1     0.9185    0.9798    0.9481      4796
           2     0.9337    0.7682    0.8429       220

    accuracy                         0.9112      5955
   macro avg     0.9001    0.7807    0.8299      5955
weighted avg     0.9079    0.9112    0.9049      5955

In [204]:
 
In [205]:
from sklearn.model_selection import learning_curve

train_sizes, train_scores, test_scores = learning_curve(clf_GBM, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best")
plt.show()

If the gap between the training and validation performance in the learning curve is small, it generally indicates that the model is generalizing well and is not overfitting on the training data. This means that the model is learning patterns that are useful for making predictions on new, unseen data.

7)KNNCLASSIFIER¶

In [206]:
from sklearn.neighbors import KNeighborsClassifier

# Create a KNN classifier with k=5
clf_knn = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier on the training data
clf_knn.fit(X_train, y_train)

# Make predictions on the training and test data
y_pred_train = clf_knn.predict(X_train)
y_pred_test = clf_knn.predict(X_test)

# Print the accuracy scores for the training and test data
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))

# Print the classification report for the training and test data
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)

# Get the metrics for the classifier and add them to the scores dictionary
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"KNN",scores)
Train Accuracy: 0.9014273719563393
Test Accuracy: 0.8723761544920235

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.7932    0.5815    0.6710      2817
           1     0.9146    0.9783    0.9454     14389
           2     0.9511    0.5903    0.7285       659

    accuracy                         0.9014     17865
   macro avg     0.8863    0.7167    0.7816     17865
weighted avg     0.8968    0.9014    0.8941     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.6958    0.4579    0.5523       939
           1     0.8919    0.9683    0.9285      4796
           2     0.9308    0.5500    0.6914       220

    accuracy                         0.8724      5955
   macro avg     0.8395    0.6587    0.7241      5955
weighted avg     0.8624    0.8724    0.8604      5955

The model has a higher accuracy on the training set (90.14%) than on the test set (87.24%), which suggests that the model may be overfitting to the training data.The recall values for the first and third classes are relatively low, especially in the test set, which suggests that the model may be struggling to correctly identify instances of these classes.

8)ANN-CLASSIFIER¶

In [207]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Define the model architecture
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(3, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))
# Get the predicted class probabilities for the train and test data
y_pred_train_prob = model.predict(X_train)
y_pred_test_prob = model.predict(X_test)

# Get the predicted classes by selecting the class with the highest probability
y_pred_train = np.argmax(y_pred_train_prob, axis=1)
y_pred_test = np.argmax(y_pred_test_prob, axis=1)
# Print the accuracy scores for the training and test data
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))

# Print the classification report for the training and test data
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
# Get the metrics for the classifier and add them to the scores dictionary
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"ANN-CLASSIFICATION",scores)
Epoch 1/10
559/559 [==============================] - 4s 5ms/step - loss: 0.4211 - accuracy: 0.8355 - val_loss: 0.3055 - val_accuracy: 0.8806
Epoch 2/10
559/559 [==============================] - 2s 4ms/step - loss: 0.3163 - accuracy: 0.8704 - val_loss: 0.2765 - val_accuracy: 0.8877
Epoch 3/10
559/559 [==============================] - 2s 4ms/step - loss: 0.2915 - accuracy: 0.8818 - val_loss: 0.2621 - val_accuracy: 0.8969
Epoch 4/10
559/559 [==============================] - 4s 7ms/step - loss: 0.2736 - accuracy: 0.8897 - val_loss: 0.2537 - val_accuracy: 0.8969
Epoch 5/10
559/559 [==============================] - 3s 4ms/step - loss: 0.2611 - accuracy: 0.8919 - val_loss: 0.2513 - val_accuracy: 0.8959
Epoch 6/10
559/559 [==============================] - 2s 3ms/step - loss: 0.2552 - accuracy: 0.8943 - val_loss: 0.2549 - val_accuracy: 0.8954
Epoch 7/10
559/559 [==============================] - 1s 3ms/step - loss: 0.2498 - accuracy: 0.8966 - val_loss: 0.2439 - val_accuracy: 0.9003
Epoch 8/10
559/559 [==============================] - 1s 3ms/step - loss: 0.2424 - accuracy: 0.8983 - val_loss: 0.2416 - val_accuracy: 0.8992
Epoch 9/10
559/559 [==============================] - 1s 2ms/step - loss: 0.2395 - accuracy: 0.9004 - val_loss: 0.2402 - val_accuracy: 0.8999
Epoch 10/10
559/559 [==============================] - 1s 3ms/step - loss: 0.2366 - accuracy: 0.9020 - val_loss: 0.2397 - val_accuracy: 0.9034
559/559 [==============================] - 1s 1ms/step
187/187 [==============================] - 0s 1ms/step
Train Accuracy: 0.9127343968653793
Test Accuracy: 0.9034424853064652

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.8639    0.5747    0.6903      2817
           1     0.9190    0.9850    0.9509     14389
           2     0.9033    0.7800    0.8371       659

    accuracy                         0.9127     17865
   macro avg     0.8954    0.7799    0.8261     17865
weighted avg     0.9097    0.9127    0.9056     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.8290    0.5421    0.6555       939
           1     0.9129    0.9831    0.9467      4796
           2     0.8864    0.7091    0.7879       220

    accuracy                         0.9034      5955
   macro avg     0.8761    0.7448    0.7967      5955
weighted avg     0.8987    0.9034    0.8949      5955

In [208]:
# Plot the training and validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

# Plot the training and validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

Both training and test accuracies have increased, and the gap between them has also decreased, which is a good sign. Additionally, looking at the classification report, we can see that the precision, recall, and F1-scores have also improved for all three classes in the test data.

In [209]:
scores
Out[209]:
Model Train_Accuracy Train_Recall Test_Accuracy Test_Recall
0 GaussianNB 0.692079 0.718431 0.688665 0.710147
1 Decisiontree 0.987629 0.984834 0.877750 0.787690
2 randomforest 0.987573 0.976436 0.907976 0.765475
3 SVMCLASSIFIER 0.896334 0.741612 0.896054 0.734057
4 XGBOOST 0.954044 0.905529 0.918220 0.815276
5 AdaBoostClassifier 0.738595 0.470149 0.739043 0.454156
6 GBM 0.919619 0.809527 0.911167 0.780735
7 KNN 0.901427 0.716692 0.872376 0.658747
8 ANN-CLASSIFICATION 0.912734 0.779894 0.903442 0.744756
In [210]:
scores.to_csv('basemodels',index=False)

##b)Choosing best models¶

1)ANN-CLASSIFIER has been by far the best model as generalized model was built with the help of ann model.And morover overall recall score is good and there is a scope of improvment too,which we will do with the help of hyperparamter tuning.


2)DECISIONTREE,RANDOMFOREST,XGBOOST,KNN:-This models are overfitting they might not perform well when compared to other models when unseen data is fed into them.we will do hyperparamter tuning for all this models and decide which model to select.


3)Naivebayesclssifier,Adaboost classifier:-This models are underfitting and may be there perfomance might be increased when there are more examples in training data sets.


4)GBM-This model is learning patterns which can be useful when prediciting unseen data.it is a genralized model as we ca observe from learning curve


5)SVM CLASSIFIER-This model is learning patterns which can be useful when prediciting unseen data.it is a genralized model as we can observe from learning curves.Only problem with svm is it is computationally inefficient.


For next set of scoring we will choose ANN CLASSIFIER,GBM,SVM CLASSIFIER and randomforest,xgboost. so we will build models using different hyperparameters.and check which one is perfoming better.


c)Validation and paramter Tuning¶

In [230]:
scores1 = pd.DataFrame(columns=['Model','Train_Accuracy','Train_Recall','Train_Precision','Train_F1_Score','Test_Accuracy','Test_Recall','Test_Precision','Test_F1_Score'])

def get_metrics(train_actual,train_predicted,test_actual,test_predicted,model_description,dataframe):
   
    train_accuracy = accuracy_score(train_actual,train_predicted)
    train_recall   = recall_score(train_actual,train_predicted,average="macro")
    train_precision= precision_score(train_actual,train_predicted,average="macro")
    train_f1score  = f1_score(train_actual,train_predicted,average="macro")
    test_accuracy = accuracy_score(test_actual,test_predicted)
    test_recall   = recall_score(test_actual,test_predicted,average="macro")
    test_precision= precision_score(test_actual,test_predicted,average="macro")
    test_f1score  = f1_score(test_actual,test_predicted,average="macro")
    dataframe = dataframe.append(pd.Series([model_description, train_accuracy,train_recall,train_precision,train_f1score,
                                            test_accuracy,test_recall,test_precision,test_f1score],
                                           index=scores1.columns ), ignore_index=True)
    return(dataframe)

For validation purpose we will use stratified k fold technique,Stratified k-fold cross-validation is a variation of k-fold cross-validation that aims to ensure that each fold is representative of the overall distribution of the target variable. This is particularly useful when the dataset is imbalanced, meaning that some classes may have significantly fewer samples than others.

For parameter tuning we will use randomized search cv as grid search cv is time-consuming and computationally expensive, especially if the hyperparameters are too many..

In [224]:
# Define the stratified k-fold cross-validator
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.model_selection import RandomizedSearchCV

Lets start randomized search cv with our overfit models first

In [225]:
#after running randomized search cv and getting results from it we are going to store it as a text to ensure lesser run time.
#Decision Tree with RandomizedSearchCV - Hyper-parameter Tuning
# set of parameters to test
param_grid = {"class_weight":['balanced', None],
              "criterion": ["gini", "entropy"],
              "max_depth": [3, 5, 6],
              "min_samples_leaf": [2, 5, 10],
               "max_leaf_nodes": [None, 5, 10, 20]
              }
dt = DecisionTreeClassifier(random_state=123)
clf_dt_grid = RandomizedSearchCV(dt, param_grid,cv=skf,scoring='recall')
clf_dt_grid.fit(X_train, y_train)  
print(clf_dt_grid.best_params_)

1)Decision tree with hyperparamters¶

output:-{'min_samples_leaf': 5, 'max_leaf_nodes': 20, 'max_depth': 6, 'criterion': 'gini', 'class_weight': 'balanced'}

In [226]:
clf_dt_grid = DecisionTreeClassifier(min_samples_leaf=10, max_leaf_nodes=30, max_depth=8, criterion='entropy', class_weight='balanced')
clf_dt_grid.fit(X=X_train, y= y_train)

y_pred_train = clf_dt_grid.predict(X_train)
y_pred_test = clf_dt_grid.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
Train Accuracy: 0.8408060453400503
Test Accuracy: 0.8397984886649874

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.5844    0.7568    0.6596      2817
           1     0.9527    0.8527    0.9000     14389
           2     0.4626    0.9393    0.6199       659

    accuracy                         0.8408     17865
   macro avg     0.6666    0.8496    0.7265     17865
weighted avg     0.8766    0.8408    0.8517     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.5774    0.7742    0.6615       939
           1     0.9568    0.8503    0.9004      4796
           2     0.4516    0.8909    0.5994       220

    accuracy                         0.8398      5955
   macro avg     0.6620    0.8385    0.7204      5955
weighted avg     0.8783    0.8398    0.8516      5955

In [231]:
scores1= get_metrics(y_train,y_pred_train,y_test,y_pred_test,"DecisionTree_BestParameters",scores1)
scores1
Out[231]:
Model Train_Accuracy Train_Recall Train_Precision Train_F1_Score Test_Accuracy Test_Recall Test_Precision Test_F1_Score
0 DecisionTree_BestParameters 0.840806 0.849623 0.666591 0.726479 0.839798 0.838476 0.661961 0.720439
In [368]:
from sklearn.model_selection import learning_curve

train_sizes, train_scores, test_scores = learning_curve(clf_dt_grid, X_train, y_train, cv=skf, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="y", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="b", label="Cross-validation score")
plt.legend(loc="best")
plt.show()

2)Decisiontree with randomized search cv using different paramters.¶

#from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, truncnorm, randint

clf3_dt = DecisionTreeClassifier(random_state=123, class_weight='balanced') 
max_leaf_nodes = np.random.normal(loc=5, scale=1, size=5).astype(int)
max_leaf_nodes[max_leaf_nodes <1] = 1
print(max_leaf_nodes)
max_depth = np.random.uniform(2,5,4).astype(int)
print(max_depth)
min_samples_split = np.random.uniform(2, 6, 5).astype(int)
print(min_samples_split)
## Set Up Hyperparameter Distributions
# normally distributed max_leaf_nodes, with mean 5 stddev 1
max_leaf_nodes = np.random.normal(loc=5, scale=1, size=5).astype(int)

# uniform distribution from 2 to 5 
max_depth = np.random.uniform(2,5,4).astype(int)

# uniform distribution from 2 to 6
min_samples_split = np.random.uniform(2, 6, 5).astype(int)

model_params = {
    'max_depth': list(max_depth),
    'max_leaf_nodes': list(max_leaf_nodes),
    'min_samples_split': list(min_samples_split)
}
clf_random = RandomizedSearchCV(estimator=clf3_dt, param_distributions=model_params, n_iter=600, cv=skf, scoring='recall', n_jobs=-1)
clf_random.fit(X_train, y_train)
print(clf_random.best_score_, clf_random.best_params_)
clf_dt_random = clf_random.best_estimator_
print(clf_dt_random)

DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=5, min_samples_split=3, random_state=123)

In [233]:
clf_dt_grid2 = DecisionTreeClassifier(class_weight='balanced', max_depth=3, max_leaf_nodes=5,
                       min_samples_split=3, random_state=123)
clf_dt_grid2.fit(X=X_train, y= y_train)

y_pred_train = clf_dt_grid2.predict(X_train)
y_pred_test = clf_dt_grid2.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
Train Accuracy: 0.6828995242093479
Test Accuracy: 0.6896725440806045

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.3269    0.8882    0.4779      2817
           1     0.9755    0.6403    0.7731     14389
           2     0.6315    0.7360    0.6797       659

    accuracy                         0.6829     17865
   macro avg     0.6447    0.7548    0.6436     17865
weighted avg     0.8606    0.6829    0.7231     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.3307    0.8829    0.4811       939
           1     0.9704    0.6499    0.7785      4796
           2     0.6822    0.7318    0.7061       220

    accuracy                         0.6897      5955
   macro avg     0.6611    0.7549    0.6552      5955
weighted avg     0.8589    0.6897    0.7289      5955

In [234]:
scores1= get_metrics(y_train,y_pred_train,y_test,y_pred_test,"DecisionTree_BestParameters-2",scores1)
scores1
Out[234]:
Model Train_Accuracy Train_Recall Train_Precision Train_F1_Score Test_Accuracy Test_Recall Test_Precision Test_F1_Score
0 DecisionTree_BestParameters 0.840806 0.849623 0.666591 0.726479 0.839798 0.838476 0.661961 0.720439
1 DecisionTree_BestParameters-2 0.682900 0.754808 0.644660 0.643605 0.689673 0.754863 0.661100 0.655250

Its not a generalized model like our previous one,we will not use this model..

2)Random forest with hyperparamter tuning

param_grid = {#"n_estimators" : [50,60],
              "criterion": ["gini", "entropy"],
              "min_samples_split": [2, 5],
              "max_depth": [None, 7],
              "min_samples_leaf": [1, 5],
              "ccp_alpha":[0.0001,0.001,0.01,0.1]
             }
rfclf2 = RandomForestClassifier()
rfclf_grid = RandomizedSearchCV(rfclf2, param_grid, cv=skf)     
rfclf_grid.fit(X_train, y_train)
print(rfclf_grid.best_params_)
rfclf_grid = rfclf_grid.best_estimator_ 
print(rfclf_grid)

RandomForestClassifier(ccp_alpha=0.0001, min_samples_split=5)

In [235]:
rfclf_grid =RandomForestClassifier(ccp_alpha=0.0001, min_samples_split=5)

rfclf_grid.fit(X=X_train, y= y_train)

y_pred_train = rfclf_grid.predict(X_train)
y_pred_test = rfclf_grid.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
Train Accuracy: 0.9399944024629163
Test Accuracy: 0.9106633081444164

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.9432    0.6954    0.8006      2817
           1     0.9382    0.9931    0.9649     14389
           2     0.9784    0.8255    0.8955       659

    accuracy                         0.9400     17865
   macro avg     0.9533    0.8380    0.8870     17865
weighted avg     0.9404    0.9400    0.9364     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.8628    0.5761    0.6909       939
           1     0.9161    0.9856    0.9496      4796
           2     0.9226    0.7045    0.7990       220

    accuracy                         0.9107      5955
   macro avg     0.9005    0.7554    0.8132      5955
weighted avg     0.9079    0.9107    0.9032      5955

In [236]:
scores1= get_metrics(y_train,y_pred_train,y_test,y_pred_test,"RandomForestClassifier",scores1)
scores1
Out[236]:
Model Train_Accuracy Train_Recall Train_Precision Train_F1_Score Test_Accuracy Test_Recall Test_Precision Test_F1_Score
0 DecisionTree_BestParameters 0.840806 0.849623 0.666591 0.726479 0.839798 0.838476 0.661961 0.720439
1 DecisionTree_BestParameters-2 0.682900 0.754808 0.644660 0.643605 0.689673 0.754863 0.661100 0.655250
2 RandomForestClassifier 0.939994 0.838011 0.953254 0.886967 0.910663 0.755434 0.900514 0.813160

its clearly a overfitting model

Lets build random forest using variable importance

In [237]:
importances = rfclf_grid.feature_importances_
importances
indices = np.argsort(importances)[::-1]
print(indices)
select = indices[0:30]
print(select)
rfclf4 = RandomForestClassifier(criterion= 'entropy', 
                                max_depth= 3, 
                                min_samples_leaf= 1,
                                min_samples_split= 2,
                                ccp_alpha=0.001)

rfclf4 = rfclf4.fit(X_train.values[:,select], y_train)


y_pred_train = rfclf4.predict(X_train.values[:,select])
y_pred_test = rfclf4.predict(X_test.values[:,select])
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
[ 0 11 14  9  8 12 10 13  4  7 21 17  6 29  3 15  2  1 16 31  5 22 30 37
 35 44 53 87 45 38 70 43 73 56 18 52 32 20 19 26 27 24 36 39 69 41 71 25
 34 90 72 86 88 59 60 92 62 74 54 91 93 58 55 79 89 96 57 61 94 33 75 95
 76 66 77 49 78 83 28 50 51 80 82 84 85 68 47 81 23 67 40 46 64 42 48 65
 63]
[ 0 11 14  9  8 12 10 13  4  7 21 17  6 29  3 15  2  1 16 31  5 22 30 37
 35 44 53 87 45 38]
Train Accuracy: 0.8784214945424014
Test Accuracy: 0.8779177162048698

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.9466    0.3269    0.4860      2817
           1     0.8743    0.9998    0.9329     14389
           2     0.8813    0.5857    0.7037       659

    accuracy                         0.8784     17865
   macro avg     0.9007    0.6375    0.7075     17865
weighted avg     0.8860    0.8784    0.8539     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.9560    0.3237    0.4837       939
           1     0.8730    1.0000    0.9322      4796
           2     0.8951    0.5818    0.7052       220

    accuracy                         0.8779      5955
   macro avg     0.9080    0.6352    0.7070      5955
weighted avg     0.8869    0.8779    0.8531      5955

Its an overfitting model

3)XGBOOST MODEL WITH RANDOMIZED CV¶

#clf_XGB_grid = XGBClassifier()
# Use a grid over parameters of interest
param_grid = {
     'colsample_bytree': np.linspace(0.6, 0.8, 2),
     'n_estimators':[100, 200],
     'max_depth': [3, 4],
     'gamma': [0.2,0.3,0.4],
     'learning_rate': [0.001, 0.01, 0.1, 1, 10],
     'scale_pos_weight':[4]
}
CV_XGB = RandomizedSearchCV(estimator=clf_XGB_grid, param_distributions=param_grid, n_jobs=-1,cv=skf, scoring='recall')
%time CV_XGB.fit(X = X_train, y=y_train)
best_xgb_model = CV_XGB.best_estimator_
print (CV_XGB.best_score_, CV_XGB.best_params_)
CV_XGB = CV_XGB.best_estimator_ 
print(CV_XGB)

{'scale_pos_weight': 4, 'n_estimators': 200, 'max_depth': 4, 'learning_rate': 0.1, 'gamma': 0.4, 'colsample_bytree': 0.6}

In [238]:
CV_XGB =XGBClassifier(scale_pos_weight=4,n_estimators=200,max_depth=4,learning_rate=0.1,gamma=0.4,colsample_bytree=0.6)

CV_XGB.fit(X=X_train, y= y_train)

y_pred_train = CV_XGB.predict(X_train)
y_pred_test = CV_XGB.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
[05:22:17] WARNING: ../src/learner.cc:767: 
Parameters: { "scale_pos_weight" } are not used.

Train Accuracy: 0.9280156731038343
Test Accuracy: 0.9163727959697733

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.8856    0.6486    0.7488      2817
           1     0.9313    0.9842    0.9570     14389
           2     0.9916    0.8953    0.9410       659

    accuracy                         0.9280     17865
   macro avg     0.9362    0.8427    0.8823     17865
weighted avg     0.9263    0.9280    0.9236     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.8509    0.6198    0.7172       939
           1     0.9237    0.9796    0.9508      4796
           2     0.9568    0.8045    0.8741       220

    accuracy                         0.9164      5955
   macro avg     0.9104    0.8013    0.8474      5955
weighted avg     0.9134    0.9164    0.9111      5955

In [239]:
scores1= get_metrics(y_train,y_pred_train,y_test,y_pred_test,"XGB_BestParameters",scores1)
scores1
Out[239]:
Model Train_Accuracy Train_Recall Train_Precision Train_F1_Score Test_Accuracy Test_Recall Test_Precision Test_F1_Score
0 DecisionTree_BestParameters 0.840806 0.849623 0.666591 0.726479 0.839798 0.838476 0.661961 0.720439
1 DecisionTree_BestParameters-2 0.682900 0.754808 0.644660 0.643605 0.689673 0.754863 0.661100 0.655250
2 RandomForestClassifier 0.939994 0.838011 0.953254 0.886967 0.910663 0.755434 0.900514 0.813160
3 XGB_BestParameters 0.928016 0.842694 0.936161 0.882260 0.916373 0.801307 0.910449 0.847361

4)GBM,SVM,ANN-HYPERPARAMTER TUNING¶

#Gradient boosting with hyperparamter tuning
# define the hyperparameter space
param_dist = {'learning_rate': [0.1, 0.05, 0.01],
              'max_depth': [2,4,6],
              'max_features': ['auto', 'sqrt', 'log2'],}

# define the classifier
clf = GradientBoostingClassifier()

# define the random search with cross validation
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=n_iter_search, cv=skf,scoring='recall')

# fit the random search to the data
random_search.fit(X_train, y_train)

# print the best hyperparameters and score
print("Best hyperparameters: ", random_search.best_params_)
print("Best score: ", random_search.best_score_)

Best hyperparameters: {'max_features': 'sqrt', 'max_depth': 4, 'learning_rate': 0.01}

In [240]:
clf_GBM = GradientBoostingClassifier(max_features='sqrt',max_depth=4,learning_rate=0.01)
clf_GBM.fit(X=X_train, y=y_train)

y_pred_train = clf_GBM.predict(X_train)
y_pred_test = clf_GBM.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))

print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"GBM",scores)
Train Accuracy: 0.8774139378673383
Test Accuracy: 0.8770780856423174

            *****************************************
               CLASSIFICATION REPORT FOR TRAIN DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.9361    0.3326    0.4908      2817
           1     0.8723    1.0000    0.9318     14389
           2     0.9458    0.5296    0.6790       659

    accuracy                         0.8774     17865
   macro avg     0.9181    0.6207    0.7005     17865
weighted avg     0.8851    0.8774    0.8529     17865


            *****************************************
               CLASSIFICATION REPORT FOR TEST DATA
            *****************************************
            
              precision    recall  f1-score   support

           0     0.9426    0.3323    0.4913       939
           1     0.8715    1.0000    0.9314      4796
           2     0.9504    0.5227    0.6745       220

    accuracy                         0.8771      5955
   macro avg     0.9215    0.6183    0.6991      5955
weighted avg     0.8856    0.8771    0.8525      5955

Its not a overfitting model but a generalized model in which recall of 0,2 are very less,our model is good in predicting 1 label but not 0 and 2

In [241]:
scores1= get_metrics(y_train,y_pred_train,y_test,y_pred_test,"GradientBoostingClassifier",scores1)
scores1
Out[241]:
Model Train_Accuracy Train_Recall Train_Precision Train_F1_Score Test_Accuracy Test_Recall Test_Precision Test_F1_Score
0 DecisionTree_BestParameters 0.840806 0.849623 0.666591 0.726479 0.839798 0.838476 0.661961 0.720439
1 DecisionTree_BestParameters-2 0.682900 0.754808 0.644660 0.643605 0.689673 0.754863 0.661100 0.655250
2 RandomForestClassifier 0.939994 0.838011 0.953254 0.886967 0.910663 0.755434 0.900514 0.813160
3 XGB_BestParameters 0.928016 0.842694 0.936161 0.882260 0.916373 0.801307 0.910449 0.847361
4 GradientBoostingClassifier 0.877414 0.620738 0.918063 0.700544 0.877078 0.618332 0.921512 0.699059

*out of all model with hyperparamter tuning best model is {'min_samples_leaf': 5, 'max_leaf_nodes': 20, 'max_depth': 6, 'criterion': 'gini', 'class_weight': 'balanced'}


10)Decision tree pattern extraction¶

In [380]:
from sklearn.tree import export_text
rules = export_text(clf_dt_grid, feature_names=X_train.columns.tolist())
# Sort rules by importance scores and select the top 20 rules
rule_scores = clf_dt_grid.feature_importances_
rule_list = rules.split('\n')
rule_list.pop() # Remove the last empty element
rule_importance = [(rule, score) for rule, score in zip(rule_list, rule_scores)]
rule_importance_sorted = sorted(rule_importance, key=lambda x: x[1], reverse=True)
top_20_rules = [rule for rule, score in rule_importance_sorted[:20]]

# Print the top 20 rules
print("Top 20 rules:")
for rule in top_20_rules:
    print(rule)
Top 20 rules:
|--- sumoforderquantity <= -0.11
|   |   |   |   |--- recency <= 0.24
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |   |--- num_transactions <= 0.01
|   |   |   |   |   |   |--- class: 0
|   |   |   |--- duration_of_customer_relation <= -0.48
|   |   |   |   |   |--- response_rate >  3.29
|   |   |--- AutomaticRefill_yes <= 0.50
|   |   |   |   |--- recency >  0.24
|   |   |   |   |   |   |--- numofemails <= -0.85
|   |   |   |   |   |   |   |--- open_rate_ratio <= 0.55
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- Registration_month_12 >  0.50
|   |   |   |   |   |--- days_since_first_transaction >  -0.35
|   |   |   |   |--- class: 1
|   |   |   |   |   |--- days_since_first_transaction <= -0.35
|   |   |   |   |   |   |   |--- open_rate_ratio >  0.55
|   |--- numofemails <= -0.29
|   |   |   |   |   |--- OnlineCommunication_yes <= 0.50
|   |   |   |   |   |--- OnlineCommunication_yes >  0.50

11)Test-data¶

In [243]:
customer=pd.read_excel("/content/Customerdata_Test.xlsx")
test=pd.read_excel("/content/Test.xlsx")
transaction=pd.read_excel("/content/transactionsdata_Test.xlsx")
email=pd.read_excel("/content/Emaildata_Test.xlsx")
In [244]:
#checking all the datasets shape to get a proper understanding
print("shape of customer",customer.shape)
print("shape of train",test.shape)
print("shape of transaction",transaction.shape)
print("shape of email",email.shape)
shape of customer (5955, 7)
shape of train (5955, 1)
shape of transaction (26768, 4)
shape of email (129074, 6)
In [245]:
#customer.head()
customer["CustomerID"].nunique()
#checking unique values for all the customer ids
Out[245]:
5955
In [246]:
#test.head()
test["CustomerID"].nunique()
#checking unique values for all the customer ids
Out[246]:
5955
In [247]:
#transaction['DateOfOrder'] = pd.to_datetime(transaction['DateOfOrder'])
transaction["CustomerID"].nunique()
#checking unique values for all the customer ids
Out[247]:
5955
In [248]:
# combine the 'Dateoforder' and 'Timestamp' columns into a single datetime column
transaction['last_transaction_date'] = pd.to_datetime(transaction['DateOfOrder'] +' '+transaction['timestamp'])
In [249]:
transaction['DateOfOrder']=pd.to_datetime(transaction['DateOfOrder'])
# Sort the dataset by customer ID and purchase date
df3 = transaction.sort_values(['CustomerID', 'DateOfOrder'])

# Calculate the time between purchases for each customer
df3['time_between_purchases'] = df3.groupby('CustomerID')['DateOfOrder'].diff().dt.days.fillna(0)
In [250]:
df3['order_month']=pd.to_datetime(df3['DateOfOrder']).dt.month
# group the data by customer ID and month, and count the number of orders in each group
order_counts = df3.groupby(['CustomerID', 'order_month'])['OrderQuantity'].sum()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent months
order_counts = order_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
order_counts = order_counts.fillna(0)
order_counts.columns=["Totalorders-jan","Totalorders-feb","Totalorders-march","Totalorders-april","Totalorders-may","Totalorders-june","Totalorders-july","Totalorders-august","Totalorders-september","Totalorders-october","Totalorders-november","Totalorders-december"]
order_counts=order_counts.reset_index(drop=False)
order_counts
Out[250]:
CustomerID Totalorders-jan Totalorders-feb Totalorders-march Totalorders-april Totalorders-may Totalorders-june Totalorders-july Totalorders-august Totalorders-september Totalorders-october Totalorders-november Totalorders-december
0 C12121 0 0 0 0 0 0 0 0 0 0 0 52
1 C12132 0 0 0 0 0 0 0 0 0 40 0 0
2 C12134 0 0 0 0 0 0 0 0 40 0 0 0
3 C12135 41 0 0 0 0 0 0 0 0 0 0 0
4 C12136 0 0 0 0 0 0 0 0 0 79 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5950 C42153 0 0 0 0 0 0 0 0 0 82 297 0
5951 C42155 0 0 0 0 0 0 0 0 41 0 0 0
5952 C42157 0 0 0 0 0 0 0 0 41 0 0 0
5953 C42167 0 0 90 74 0 0 0 0 0 187 151 0
5954 C42190 0 0 0 0 0 0 0 0 40 0 0 0

5955 rows × 13 columns

In [251]:
# Group the data by customer ID and calculate the number of transactions, total amount spent, and last transaction date
grouped_data = df3.groupby('CustomerID').agg({
    'CustomerID': 'count', # Number of transactions
    'OrderQuantity': ['sum','min','max'],# Total orders
    'last_transaction_date': ['min', 'max'],# first transaction date,Last transaction date
    'time_between_purchases':'mean'#avaerage time between purchases
    
})
# Rename the columns to be more descriptive
grouped_data.columns = ['num_transactions','totalorders','minimumorderquantity','maximumorderquantity','first_transaction_date','last_transaction_date','average_time_between_purchase']
In [252]:
grouped_data['DateOfOrder'] = grouped_data['last_transaction_date'].dt.date.astype(str)
grouped_data['timestamp'] = grouped_data['last_transaction_date'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
grouped_data["averageorders"]=grouped_data["totalorders"]//grouped_data["num_transactions"]
grouped_data = grouped_data.reset_index(drop=False)
In [253]:
# merge the grouped DataFrame back with the main DataFrame
transaction_data = pd.merge(df3, grouped_data, on=['CustomerID','last_transaction_date'],how="inner")
In [254]:
transaction_data = pd.merge(transaction_data, order_counts, on=['CustomerID'],how="inner")
In [255]:
transaction['last_transaction_date']
transaction = transaction.rename(columns={'last_transaction_date': 'transaction_date'})
In [256]:
transaction_data.shape
#rechecking for incorrect dtypes
#transaction_data["CustomerID"].nunique()
#rechecking for unique customerids
Out[256]:
(5955, 28)
In [257]:
transaction_data.columns
Out[257]:
Index(['CustomerID', 'DateOfOrder_x', 'timestamp_x', 'OrderQuantity',
       'last_transaction_date', 'time_between_purchases', 'order_month',
       'num_transactions', 'totalorders', 'minimumorderquantity',
       'maximumorderquantity', 'first_transaction_date',
       'average_time_between_purchase', 'DateOfOrder_y', 'timestamp_y',
       'averageorders', 'Totalorders-jan', 'Totalorders-feb',
       'Totalorders-march', 'Totalorders-april', 'Totalorders-may',
       'Totalorders-june', 'Totalorders-july', 'Totalorders-august',
       'Totalorders-september', 'Totalorders-october', 'Totalorders-november',
       'Totalorders-december'],
      dtype='object')
In [258]:
transaction_data=transaction_data.drop(['DateOfOrder_x', 'timestamp_x','DateOfOrder_y', 'timestamp_y','time_between_purchases','order_month'],axis=1)
In [259]:
transaction_data.head()
Out[259]:
CustomerID OrderQuantity last_transaction_date num_transactions totalorders minimumorderquantity maximumorderquantity first_transaction_date average_time_between_purchase averageorders Totalorders-jan Totalorders-feb Totalorders-march Totalorders-april Totalorders-may Totalorders-june Totalorders-july Totalorders-august Totalorders-september Totalorders-october Totalorders-november Totalorders-december
0 C12121 52 2017-12-14 11:38:00 1 52 52 52 2017-12-14 11:38:00 0.0 52 0 0 0 0 0 0 0 0 0 0 0 52
1 C12132 40 2013-10-20 17:52:44 1 40 40 40 2013-10-20 17:52:44 0.0 40 0 0 0 0 0 0 0 0 0 40 0 0
2 C12134 40 2013-09-18 10:24:48 1 40 40 40 2013-09-18 10:24:48 0.0 40 0 0 0 0 0 0 0 0 40 0 0 0
3 C12135 41 2018-01-10 19:33:58 1 41 41 41 2018-01-10 19:33:58 0.0 41 41 0 0 0 0 0 0 0 0 0 0 0
4 C12136 79 2013-10-19 20:51:18 1 79 79 79 2013-10-19 20:51:18 0.0 79 0 0 0 0 0 0 0 0 0 79 0 0
In [260]:
email.head()
Out[260]:
CustomerID DateOfemail EmailType timestamp MailOpened MailClicked
0 C17838 2010-10-22 WelcomeEmail 16:59:02 yes NotApplicable
1 C12403 2008-11-17 WelcomeEmail 12:46:56 yes NotApplicable
2 C19041 2011-01-24 WelcomeEmail 14:35:22 yes NotApplicable
3 C24069 2010-09-18 WelcomeEmail 12:25:26 yes NotApplicable
4 C26718 2010-10-02 WelcomeEmail 11:55:58 yes NotApplicable
In [261]:
email.columns
Out[261]:
Index(['CustomerID', 'DateOfemail', 'EmailType', 'timestamp', 'MailOpened',
       'MailClicked'],
      dtype='object')
In [262]:
# combine the 'Date of email' and 'Time' columns into a single datetime column
email['emailDatetime'] = pd.to_datetime(email['DateOfemail'] +' '+email['timestamp'])
In [263]:
email["EmailType"].value_counts()
Out[263]:
CustomizedEmail    123119
WelcomeEmail         5955
Name: EmailType, dtype: int64
In [264]:
email.columns
Out[264]:
Index(['CustomerID', 'DateOfemail', 'EmailType', 'timestamp', 'MailOpened',
       'MailClicked', 'emailDatetime'],
      dtype='object')
In [265]:
# group the data by customer ID and month, and count the number of times mail oponed or not for each customer
mail_open_counts = email.groupby(['CustomerID', 'MailOpened'])['emailDatetime'].count()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent mail oponed
mail_open_counts = mail_open_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
mail_open_counts = mail_open_counts.fillna(0)
mail_open_counts.columns=["Totalmails-notoponed","Totalmails-oponed"]
mail_open_counts=mail_open_counts.reset_index(drop=False)
In [266]:
# group the data by customer ID and month, and count the number of times mail oponed or not for each customer
MailClicked_counts = email.groupby(['CustomerID', 'MailClicked'])['emailDatetime'].count()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent mail oponed
MailClicked_counts = MailClicked_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
MailClicked_counts = MailClicked_counts.fillna(0)
MailClicked_counts.columns=["Notapplicableclickedmails","timesclicked-yes","timesclicked-no"]
MailClicked_counts=MailClicked_counts.reset_index(drop=False)
In [267]:
MailClicked_counts
Out[267]:
CustomerID Notapplicableclickedmails timesclicked-yes timesclicked-no
0 C12121 1 21 3
1 C12132 1 0 0
2 C12134 1 0 0
3 C12135 1 24 5
4 C12136 1 40 1
... ... ... ... ...
5950 C42153 1 9 1
5951 C42155 1 0 0
5952 C42157 1 0 0
5953 C42167 1 48 1
5954 C42190 1 0 0

5955 rows × 4 columns

In [268]:
#filtering customers who have received welcome mail but not a customized mail
welcome_mail = email[(email['EmailType'] == 'WelcomeEmail') & ~(email['CustomerID'].isin(email[email['EmailType'] == 'CustomizedEmail']['CustomerID'].unique()))]
welcome_mail["CustomerID"].nunique()
Out[268]:
2457
In [269]:
welcome_mail['DateOfemail']=pd.to_datetime(welcome_mail['DateOfemail'])
# Sort the dataset by customer ID and purchase date
welcome_mail = welcome_mail.sort_values(['CustomerID', 'DateOfemail'])

# Calculate the time between purchases for each customer
welcome_mail['time_between_emails'] = welcome_mail.groupby('CustomerID')['DateOfemail'].diff().dt.days.fillna(0)
In [270]:
# Group the data by customer ID and calculate the number of emails sent to customer,first email sent, and last email date
grouped_data3 = welcome_mail.groupby('CustomerID').agg({
    'CustomerID': 'count', # Number of emails
    'emailDatetime': ['max','min'], # Last email date,first email date
    'time_between_emails':'mean' 
})
# Rename the columns to be more descriptive
grouped_data3.columns = ['numofemails','lastemailDatetime','firstemailDatetime','average_time_between_mails']
grouped_data3 = grouped_data3.reset_index(drop=False)
In [271]:
grouped_data3['DateOfemail'] = grouped_data3['lastemailDatetime'].dt.date.astype(str)
grouped_data3['timestamp'] = grouped_data3['lastemailDatetime'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
In [272]:
welcome_mail['DateOfemail']=welcome_mail['DateOfemail'].dt.date.astype(str)
In [273]:
# merge the grouped DataFrame back with the main DataFrame
merged_data3 = pd.merge(welcome_mail, grouped_data3, on=['CustomerID','DateOfemail','timestamp'],how="inner")
In [274]:
#merged_data.shape
#merged_data["CustomerID"].nunique()
#merged_data.columns
merged_data3=merged_data3.drop(['DateOfemail','timestamp'],axis=1)
In [275]:
#filtering persons who received customized mail
customizedmail=email[email['EmailType'] == 'CustomizedEmail']
In [276]:
customizedmail['DateOfemail']=pd.to_datetime(customizedmail['DateOfemail'])
# Sort the dataset by customer ID and purchase date
customizedmail = customizedmail.sort_values(['CustomerID', 'DateOfemail'])

# Calculate the time between purchases for each customer
customizedmail['time_between_emails'] = customizedmail.groupby('CustomerID')['DateOfemail'].diff().dt.days.fillna(0)
In [277]:
# Group the data by customer ID and calculate the number of emails sent to customer,first email sent, and last email date
grouped_data2 = customizedmail.groupby('CustomerID').agg({
    'CustomerID': 'count', # Number of emails
    'emailDatetime': ['max','min'], # Last email date,first email date
    'time_between_emails':'mean' 
})
In [278]:
# Rename the columns to be more descriptive
grouped_data2.columns = ['numofemails','lastemailDatetime','firstemailDatetime','average_time_between_mails']
grouped_data2 = grouped_data2.reset_index(drop=False)
grouped_data2['DateOfemail'] = grouped_data2['lastemailDatetime'].dt.date.astype(str)
grouped_data2['timestamp'] = grouped_data2['lastemailDatetime'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
In [279]:
customizedmail['DateOfemail']=customizedmail['DateOfemail'].dt.date.astype(str)
In [280]:
# merge the grouped DataFrame back with the main DataFrame
merged_data = pd.merge(customizedmail, grouped_data2, on=['CustomerID','DateOfemail','timestamp'],how="inner")
In [281]:
#merged_data.shape
#merged_data["CustomerID"].nunique()
#merged_data.columns
merged_data=merged_data.drop(['DateOfemail','timestamp'],axis=1)
In [282]:
mail_data= pd.concat([merged_data3,merged_data])
mail_data.shape
Out[282]:
(5955, 10)
In [283]:
mail_data=mail_data.drop(["time_between_emails"],axis=1)
In [284]:
test=customer.merge(test, on="CustomerID").merge(mail_data, on="CustomerID").merge(transaction_data, on="CustomerID").merge(mail_open_counts, on="CustomerID").merge(MailClicked_counts, on="CustomerID")
test.shape
Out[284]:
(5955, 41)
In [285]:
#checking for unique values in each column such that we could whether there are any incorrect dtypes
for i in test.columns:
  print("no of unique values and dtype:","|",i,"|",test[i].nunique(),"|",test[i].dtypes)
no of unique values and dtype: | CustomerID | 5955 | object
no of unique values and dtype: | City | 4 | object
no of unique values and dtype: | DateOfRegistration | 1631 | datetime64[ns]
no of unique values and dtype: | OnlineCommunication | 2 | int64
no of unique values and dtype: | AutomaticRefill | 2 | int64
no of unique values and dtype: | DoorstepDelivery | 2 | int64
no of unique values and dtype: | PreferredDeliveryDay | 7 | object
no of unique values and dtype: | EmailType | 2 | object
no of unique values and dtype: | MailOpened | 2 | object
no of unique values and dtype: | MailClicked | 3 | object
no of unique values and dtype: | emailDatetime | 5954 | datetime64[ns]
no of unique values and dtype: | numofemails | 75 | int64
no of unique values and dtype: | lastemailDatetime | 5954 | datetime64[ns]
no of unique values and dtype: | firstemailDatetime | 5955 | datetime64[ns]
no of unique values and dtype: | average_time_between_mails | 2097 | float64
no of unique values and dtype: | OrderQuantity | 234 | int64
no of unique values and dtype: | last_transaction_date | 5953 | datetime64[ns]
no of unique values and dtype: | num_transactions | 106 | int64
no of unique values and dtype: | totalorders | 1030 | int64
no of unique values and dtype: | minimumorderquantity | 221 | int64
no of unique values and dtype: | maximumorderquantity | 248 | int64
no of unique values and dtype: | first_transaction_date | 5954 | datetime64[ns]
no of unique values and dtype: | average_time_between_purchase | 1038 | float64
no of unique values and dtype: | averageorders | 234 | int64
no of unique values and dtype: | Totalorders-jan | 381 | int64
no of unique values and dtype: | Totalorders-feb | 298 | int64
no of unique values and dtype: | Totalorders-march | 316 | int64
no of unique values and dtype: | Totalorders-april | 318 | int64
no of unique values and dtype: | Totalorders-may | 342 | int64
no of unique values and dtype: | Totalorders-june | 335 | int64
no of unique values and dtype: | Totalorders-july | 352 | int64
no of unique values and dtype: | Totalorders-august | 364 | int64
no of unique values and dtype: | Totalorders-september | 365 | int64
no of unique values and dtype: | Totalorders-october | 405 | int64
no of unique values and dtype: | Totalorders-november | 407 | int64
no of unique values and dtype: | Totalorders-december | 401 | int64
no of unique values and dtype: | Totalmails-notoponed | 64 | int64
no of unique values and dtype: | Totalmails-oponed | 52 | int64
no of unique values and dtype: | Notapplicableclickedmails | 1 | int64
no of unique values and dtype: | timesclicked-yes | 73 | int64
no of unique values and dtype: | timesclicked-no | 23 | int64
In [286]:
#converting dtypes of the following columns
test["OnlineCommunication"]=test["OnlineCommunication"].astype("object")
test["DoorstepDelivery"]=test["DoorstepDelivery"].astype("object")
test["AutomaticRefill"]=test["AutomaticRefill"].astype("object")
In [287]:
test["days_since_first_transaction"]=(test['last_transaction_date'] - test['first_transaction_date']).dt.days
test["email_duration"]=(test['lastemailDatetime']- test['firstemailDatetime']).dt.days
test["duration_of_customer_relation"]=(test['last_transaction_date'] - test['DateOfRegistration']).dt.days
test["recency"]=(test["last_transaction_date"].max()-test["last_transaction_date"]).dt.days
In [288]:
test["email_duration"]
Out[288]:
0          0
1       1859
2       1058
3         50
4       1023
        ... 
5950       0
5951       0
5952       0
5953       0
5954       0
Name: email_duration, Length: 5955, dtype: int64
In [289]:
test["email_duration"]
Out[289]:
0          0
1       1859
2       1058
3         50
4       1023
        ... 
5950       0
5951       0
5952       0
5953       0
5954       0
Name: email_duration, Length: 5955, dtype: int64
In [290]:
bins = pd.cut(test["days_since_first_transaction"], [-1, 1, 10, 100, test["days_since_first_transaction"].max()], labels=["Only transaction", "Recent Transactions", "Moderate History", "Long-term History"])
test["days_since_first_transaction_category"] = bins
test["days_since_first_transaction_category"].value_counts()
Out[290]:
Only transaction       3748
Moderate History       1021
Long-term History       987
Recent Transactions     199
Name: days_since_first_transaction_category, dtype: int64
In [291]:
bins = pd.cut(test["email_duration"], [-1, 1, 10, 100, test["email_duration"].max()], labels=["Only email", "Medium history", "long mail history", "very Long mail History"])
test["email_duration_category"] = bins
test["email_duration_category"].value_counts()
Out[291]:
Only email                2790
long mail history         1496
very Long mail History    1176
Medium history             493
Name: email_duration_category, dtype: int64
In [292]:
test["email_duration_category"]=test["email_duration_category"].astype("object")
test["days_since_first_transaction_category"]=test["days_since_first_transaction_category"].astype("object")
In [293]:
#as we had already binned this columns so we will drop them
#test=test.drop(["days_since_first_transaction","email_duration"],axis=1)
In [294]:
test.columns
Out[294]:
Index(['CustomerID', 'City', 'DateOfRegistration', 'OnlineCommunication',
       'AutomaticRefill', 'DoorstepDelivery', 'PreferredDeliveryDay',
       'EmailType', 'MailOpened', 'MailClicked', 'emailDatetime',
       'numofemails', 'lastemailDatetime', 'firstemailDatetime',
       'average_time_between_mails', 'OrderQuantity', 'last_transaction_date',
       'num_transactions', 'totalorders', 'minimumorderquantity',
       'maximumorderquantity', 'first_transaction_date',
       'average_time_between_purchase', 'averageorders', 'Totalorders-jan',
       'Totalorders-feb', 'Totalorders-march', 'Totalorders-april',
       'Totalorders-may', 'Totalorders-june', 'Totalorders-july',
       'Totalorders-august', 'Totalorders-september', 'Totalorders-october',
       'Totalorders-november', 'Totalorders-december', 'Totalmails-notoponed',
       'Totalmails-oponed', 'Notapplicableclickedmails', 'timesclicked-yes',
       'timesclicked-no', 'days_since_first_transaction', 'email_duration',
       'duration_of_customer_relation', 'recency',
       'days_since_first_transaction_category', 'email_duration_category'],
      dtype='object')
In [295]:
test["Totalorderquantity_in_first_quarter"]=test["Totalorders-jan"]+test["Totalorders-feb"]+test["Totalorders-march"]+test["Totalorders-april"]
test["Totalorderquantity_in_second_quarter"]=test["Totalorders-may"]+test["Totalorders-june"]+test["Totalorders-july"]+test["Totalorders-august"]
test["Totalorderquantity_in_Third_quarter"]=test["Totalorders-december"]+test["Totalorders-september"]+test["Totalorders-october"]+test["Totalorders-november"]
test=test.drop(['Totalorders-jan', 'Totalorders-feb',
       'Totalorders-march', 'Totalorders-april', 'Totalorders-may',
       'Totalorders-june', 'Totalorders-july', 'Totalorders-august',
       'Totalorders-september', 'Totalorders-october', 'Totalorders-november',
       'Totalorders-december'],axis=1)
In [296]:
test["OnlineCommunication"].value_counts()
# map OnlineCommunication to new categories
new_categories = {
    0: 'no',
    1: 'yes',
}
test['OnlineCommunication'] = test['OnlineCommunication'].map(new_categories)
test['OnlineCommunication']=test['OnlineCommunication'].astype("category")

# print the resulting dataframe
print(test['OnlineCommunication'].value_counts())
yes    3926
no     2029
Name: OnlineCommunication, dtype: int64
In [297]:
#test["AutomaticRefill"].value_counts()
# map AutomaticRefill to new categories
new_categories = {
    0: 'no',
    1: 'yes',
}
test['AutomaticRefill'] = test['AutomaticRefill'].map(new_categories)

# print the resulting dataframe
print(test['AutomaticRefill'].value_counts())
test['AutomaticRefill']=test['AutomaticRefill'].astype("category")
no     5389
yes     566
Name: AutomaticRefill, dtype: int64
In [298]:
#test["DoorstepDelivery"].value_counts()
# map DoorstepDelivery to new categories
new_categories = {
    0: 'no',
    1: 'yes',
}
test['DoorstepDelivery'] = test['DoorstepDelivery'].map(new_categories)
test['DoorstepDelivery']=test['DoorstepDelivery'].astype("category")

# print the resulting dataframe
print(test['DoorstepDelivery'].value_counts())
no     5723
yes     232
Name: DoorstepDelivery, dtype: int64
In [299]:
test["PreferredDeliveryDay"].value_counts()
# map PreferredDeliveryDay to new categories
new_categories = {
    'Monday': 'Monday',
    'Tuesday':'Tuesday',
    'Wednesday':'Wednesday',
    'Thursday':'Thursday',
    'Friday':'Friday',
    'Saturday':'weekend',
    'Sunday':'weekend'

}
test['PreferredDeliveryDay'] = test['PreferredDeliveryDay'].map(new_categories)
test['PreferredDeliveryDay']=test['PreferredDeliveryDay'].astype("category")
# print the resulting dataframe
print(test['PreferredDeliveryDay'].value_counts())
Monday       1350
Tuesday      1311
Friday       1052
Thursday     1008
Wednesday     968
weekend       266
Name: PreferredDeliveryDay, dtype: int64
In [300]:
test["Notapplicableclickedmails"].value_counts()#there is only one value present in whole hence we can remove this column.
test=test.drop(["Notapplicableclickedmails"],axis=1)
In [301]:
categoricalcolumns=train.select_dtypes(include='object')
In [302]:
test.dtypes
Out[302]:
CustomerID                                       object
City                                             object
DateOfRegistration                       datetime64[ns]
OnlineCommunication                            category
AutomaticRefill                                category
DoorstepDelivery                               category
PreferredDeliveryDay                           category
EmailType                                        object
MailOpened                                       object
MailClicked                                      object
emailDatetime                            datetime64[ns]
numofemails                                       int64
lastemailDatetime                        datetime64[ns]
firstemailDatetime                       datetime64[ns]
average_time_between_mails                      float64
OrderQuantity                                     int64
last_transaction_date                    datetime64[ns]
num_transactions                                  int64
totalorders                                       int64
minimumorderquantity                              int64
maximumorderquantity                              int64
first_transaction_date                   datetime64[ns]
average_time_between_purchase                   float64
averageorders                                     int64
Totalmails-notoponed                              int64
Totalmails-oponed                                 int64
timesclicked-yes                                  int64
timesclicked-no                                   int64
days_since_first_transaction                      int64
email_duration                                    int64
duration_of_customer_relation                     int64
recency                                           int64
days_since_first_transaction_category            object
email_duration_category                          object
Totalorderquantity_in_first_quarter               int64
Totalorderquantity_in_second_quarter              int64
Totalorderquantity_in_Third_quarter               int64
dtype: object
In [303]:
for i in categoricalcolumns:
  test[i]=test_columns[i].astype("category")
In [304]:
test['averageorders'].describe()
Out[304]:
count    5955.000000
mean       60.636104
std        35.643257
min         1.000000
25%        41.000000
50%        51.000000
75%        73.000000
max       486.000000
Name: averageorders, dtype: float64
In [305]:
# Identify most profitable customers
profitable_customers = test[test['averageorders'] > 100].reset_index()

# Optimize inventory management
large_order_customers = test[test['averageorders'] > 50]
small_order_customers = test[test['averageorders'] <= 50]
# Improve customer experience based on order quantity
test['customer_segment'] = test.apply(lambda row: 'large_order_customer' if row['OrderQuantity'] > 50 else 'small_order_customer', axis=1)
test['customer_segment']=test['customer_segment'].astype("category")
In [306]:
test["Time_to_First_Email"]=(test["first_transaction_date"]-test["firstemailDatetime"]).dt.days
test["Time_to_First_Email"]=test["Time_to_First_Email"].astype("int")
test["Time_to_First_Email"] = np.where(test["Time_to_First_Email"] < 0, 0, test["Time_to_First_Email"])
test["Time_since_Last_Email"]=(test["last_transaction_date"]-test["lastemailDatetime"]).dt.days
test["Time_since_Last_Email"]=test["Time_since_Last_Email"].astype("int")
In [307]:
# Loop over the rows of the dataframe and create a new column 'most_ordered_quarter'
for index, row in test.iterrows():
    if row['Totalorderquantity_in_first_quarter'] > row['Totalorderquantity_in_second_quarter'] and row['Totalorderquantity_in_first_quarter'] > row['Totalorderquantity_in_Third_quarter']:
        test.at[index, 'most_ordered_quarter'] = 'Q1'
    elif row['Totalorderquantity_in_second_quarter'] > row['Totalorderquantity_in_first_quarter'] and row['Totalorderquantity_in_second_quarter'] > row['Totalorderquantity_in_Third_quarter']:
        test.at[index, 'most_ordered_quarter'] = 'Q2'
    else:
        test.at[index, 'most_ordered_quarter'] = 'Q3'
In [308]:
test['most_ordered_quarter']=test['most_ordered_quarter'].astype("category")
test['City']=test['City'].astype("category")
test['EmailType']=test['EmailType'].astype("category")
test['MailOpened']=test['MailOpened'].astype("category")
test['MailClicked']=test['MailClicked'].astype("category")
In [309]:
test
Out[309]:
CustomerID City DateOfRegistration OnlineCommunication AutomaticRefill DoorstepDelivery PreferredDeliveryDay EmailType MailOpened MailClicked emailDatetime numofemails lastemailDatetime firstemailDatetime average_time_between_mails OrderQuantity last_transaction_date num_transactions totalorders minimumorderquantity maximumorderquantity first_transaction_date average_time_between_purchase averageorders Totalmails-notoponed Totalmails-oponed timesclicked-yes timesclicked-no days_since_first_transaction email_duration duration_of_customer_relation recency days_since_first_transaction_category email_duration_category Totalorderquantity_in_first_quarter Totalorderquantity_in_second_quarter Totalorderquantity_in_Third_quarter customer_segment Time_to_First_Email Time_since_Last_Email most_ordered_quarter
0 C17838 CITY4 2010-10-22 no no no Thursday WelcomeEmail yes NotApplicable 2010-10-22 16:59:02 1 2010-10-22 16:59:02 2010-10-22 16:59:02 0.000000 55 2011-03-28 17:41:08 1 55 55 55 2011-03-28 17:41:08 0.000000 55 0 1 0 0 0 0 157 2490 Only transaction Only email 55 0 0 large_order_customer 157 157 Q1
1 C12403 CITY2 2008-11-17 yes yes no Wednesday CustomizedEmail yes no 2013-12-29 13:32:42 46 2013-12-29 13:32:42 2008-11-25 15:40:12 40.434783 182 2014-01-14 10:57:06 168 29537 150 206 2010-10-12 15:58:02 7.083333 175 9 38 39 7 1189 1859 1884 1468 Long-term History very Long mail History 7507 11045 10985 large_order_customer 686 15 Q2
2 C19041 CITY4 2011-01-24 no yes no Friday CustomizedEmail no no 2013-12-31 13:02:12 60 2013-12-31 13:02:12 2011-02-07 11:00:44 17.633333 107 2014-01-16 17:46:14 122 14321 93 142 2011-05-16 14:36:36 8.000000 117 34 27 56 4 976 1058 1088 1465 Long-term History very Long mail History 3812 4783 5726 large_order_customer 98 16 Q3
3 C24069 CITY2 2010-09-18 no no no Wednesday CustomizedEmail no no 2010-11-08 17:25:24 45 2010-11-08 17:25:24 2010-09-19 10:07:08 1.111111 47 2010-11-09 18:31:18 1 47 47 47 2010-11-09 18:31:18 0.000000 47 45 1 45 0 0 50 52 2629 Only transaction long mail history 0 0 47 small_order_customer 51 1 Q3
4 C26718 CITY2 2010-10-02 yes no no Monday CustomizedEmail yes no 2014-01-12 11:33:42 13 2014-01-12 11:33:42 2011-03-26 10:20:28 78.692308 159 2014-01-12 13:04:04 129 20238 125 185 2011-01-25 17:08:08 8.395349 156 5 9 11 2 1082 1023 1198 1470 Long-term History very Long mail History 6096 6616 7526 large_order_customer 0 0 Q3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5950 C32947 CITY2 2018-01-17 yes no no Friday WelcomeEmail yes NotApplicable 2018-01-17 13:05:56 1 2018-01-17 13:05:56 2018-01-17 13:05:56 0.000000 41 2018-01-17 21:35:26 1 41 41 41 2018-01-17 21:35:26 0.000000 41 0 1 0 0 0 0 0 3 Only transaction Only email 41 0 0 small_order_customer 0 0 Q1
5951 C28951 CITY3 2018-01-17 yes no no Friday WelcomeEmail yes NotApplicable 2018-01-17 15:45:18 1 2018-01-17 15:45:18 2018-01-17 15:45:18 0.000000 40 2018-01-17 17:05:50 1 40 40 40 2018-01-17 17:05:50 0.000000 40 0 1 0 0 0 0 0 3 Only transaction Only email 40 0 0 small_order_customer 0 0 Q1
5952 C38741 CITY3 2018-01-17 yes no no Monday WelcomeEmail yes NotApplicable 2018-01-17 11:51:28 1 2018-01-17 11:51:28 2018-01-17 11:51:28 0.000000 36 2018-01-17 18:56:16 1 36 36 36 2018-01-17 18:56:16 0.000000 36 0 1 0 0 0 0 0 3 Only transaction Only email 36 0 0 small_order_customer 0 0 Q1
5953 C20407 CITY2 2018-01-17 yes no no Friday WelcomeEmail yes NotApplicable 2018-01-17 10:25:30 1 2018-01-17 10:25:30 2018-01-17 10:25:30 0.000000 75 2018-01-17 10:24:52 1 75 75 75 2018-01-17 10:24:52 0.000000 75 0 1 0 0 0 0 0 4 Only transaction Only email 75 0 0 large_order_customer 0 -1 Q1
5954 C27220 CITY3 2018-01-17 yes no no Friday WelcomeEmail yes NotApplicable 2018-01-17 14:36:32 1 2018-01-17 14:36:32 2018-01-17 14:36:32 0.000000 40 2018-01-17 12:04:10 1 40 40 40 2018-01-17 12:04:10 0.000000 40 0 1 0 0 0 0 0 4 Only transaction Only email 40 0 0 small_order_customer 0 -1 Q1

5955 rows × 41 columns

In [310]:
test['Registration_month']=test['DateOfRegistration'].dt.month
test['RegistrationYear']=test['DateOfRegistration'].dt.year
test['Registration_weekday']=test['DateOfRegistration'].dt.weekday
test['lastemail_month']=test['lastemailDatetime'].dt.month
test['lastemail_weekday']=test['lastemailDatetime'].dt.weekday
test['transaction_month']=test['last_transaction_date'].dt.month
test['transaction_weekday']=test['last_transaction_date'].dt.weekday
test=test.drop(["last_transaction_date","lastemailDatetime","DateOfRegistration"],axis=1)
In [311]:
datetypecolumns
test=test.drop(['firstemailDatetime',"first_transaction_date","emailDatetime"],axis=1)
In [312]:
for col in ['Registration_month','RegistrationYear',
       'Registration_weekday', 'lastemail_month', 'lastemail_weekday',
       'transaction_month', 'transaction_weekday']:
   test[col] =test[col].astype('category')
In [313]:
#feature elimination for numerical columns
test=test.drop(['Totalorderquantity_in_first_quarter',
       'Totalorderquantity_in_second_quarter',
       'Totalorderquantity_in_Third_quarter','minimumorderquantity', 'maximumorderquantity','averageorders'],axis=1)
In [314]:
test['open_rate'] = np.log(test['Totalmails-oponed'] / test['numofemails'])
test['click_through_rate'] = np.log(test['totalorders'] / test['Totalmails-oponed'])
test['response_rate'] = np.log(test['totalorders'] / test['numofemails'])
test['open_rate_ratio'] =np.log(test['Totalmails-oponed'] / test['Totalmails-notoponed'])
test['click_through_rate_ratio'] = np.log(test['timesclicked-yes'] / test['timesclicked-no'])
test['response_rate_ratio'] = np.log(test['totalorders']) / (test['numofemails'] - test['totalorders'])
In [315]:
# Replace inf and NaN values in selected columns
cols = ['open_rate_ratio', 'click_through_rate_ratio', 'response_rate_ratio','click_through_rate','open_rate','response_rate']
replace_value = 0 # Replace with NaN, or any other value you want
test[cols] = test[cols].replace([np.inf, -np.inf], 0)
test[cols] = test[cols].fillna(replace_value)
In [316]:
test1=test.drop(['Totalmails-notoponed','Totalmails-oponed', 'timesclicked-yes', 'timesclicked-no'],axis=1)
In [317]:
test1 = test1.rename(columns={'totalorders': 'sumoforderquantity'})
In [318]:
len(test1.columns)
Out[318]:
38
In [319]:
len(train.columns)#duration_of_customer_relation,recency,#retentionpotenial,RegistrationYear
Out[319]:
39
In [320]:
train_columns = set(train.columns)
test_columns = set(test1.columns)
missing_columns = train_columns - test_columns

print("Columns missing in test data:", missing_columns)
Columns missing in test data: {'RetentionPotential'}
In [321]:
# split into input (X) and output (y) variables
test1=test1.drop(['CustomerID','days_since_first_transaction_category', 'email_duration_category'], axis=1)
In [322]:
cat_cols
Out[322]:
Index(['City', 'OnlineCommunication', 'AutomaticRefill', 'DoorstepDelivery',
       'PreferredDeliveryDay', 'EmailType', 'MailOpened', 'MailClicked',
       'RegistrationYear', 'customer_segment', 'most_ordered_quarter',
       'Registration_month', 'Registration_weekday', 'lastemail_month',
       'lastemail_weekday', 'transaction_month', 'transaction_weekday'],
      dtype='object')
In [323]:
cat_cols=test1.select_dtypes(include=['category'])

cat_cols=cat_cols.columns
num_cols=test1.select_dtypes(include=['int64', 'float64'])
num_cols=num_cols.drop(['open_rate','click_through_rate', 'response_rate', 'open_rate_ratio','click_through_rate_ratio', 'response_rate_ratio'],axis=1)
num_cols=num_cols.columns
 ## Convert Categorical Columns to Dummies
test1 = pd.get_dummies(test1, columns=cat_cols,drop_first=True)
#scaling 
scaler = StandardScaler()
# scale on train
test1[num_cols] =scaler.fit_transform(test1[num_cols])
In [324]:
len(test1.columns)
Out[324]:
97
In [325]:
len(X.columns)
Out[325]:
97
In [326]:
test1
Out[326]:
numofemails average_time_between_mails OrderQuantity num_transactions sumoforderquantity average_time_between_purchase days_since_first_transaction email_duration duration_of_customer_relation recency Time_to_First_Email Time_since_Last_Email open_rate click_through_rate response_rate open_rate_ratio click_through_rate_ratio response_rate_ratio City_CITY2 City_CITY3 City_CITY4 OnlineCommunication_yes AutomaticRefill_yes DoorstepDelivery_yes PreferredDeliveryDay_Monday PreferredDeliveryDay_Thursday PreferredDeliveryDay_Tuesday PreferredDeliveryDay_Wednesday PreferredDeliveryDay_weekend EmailType_WelcomeEmail MailOpened_yes MailClicked_no MailClicked_yes customer_segment_small_order_customer most_ordered_quarter_Q2 most_ordered_quarter_Q3 Registration_month_2 Registration_month_3 Registration_month_4 Registration_month_5 Registration_month_6 Registration_month_7 Registration_month_8 Registration_month_9 Registration_month_10 Registration_month_11 Registration_month_12 RegistrationYear_2009 RegistrationYear_2010 RegistrationYear_2011 RegistrationYear_2012 RegistrationYear_2013 RegistrationYear_2014 RegistrationYear_2015 RegistrationYear_2016 RegistrationYear_2017 RegistrationYear_2018 Registration_weekday_1 Registration_weekday_2 Registration_weekday_3 Registration_weekday_4 Registration_weekday_5 Registration_weekday_6 lastemail_month_2 lastemail_month_3 lastemail_month_4 lastemail_month_5 lastemail_month_6 lastemail_month_7 lastemail_month_8 lastemail_month_9 lastemail_month_10 lastemail_month_11 lastemail_month_12 lastemail_weekday_1 lastemail_weekday_2 lastemail_weekday_3 lastemail_weekday_4 lastemail_weekday_5 lastemail_weekday_6 transaction_month_2 transaction_month_3 transaction_month_4 transaction_month_5 transaction_month_6 transaction_month_7 transaction_month_8 transaction_month_9 transaction_month_10 transaction_month_11 transaction_month_12 transaction_weekday_1 transaction_weekday_2 transaction_weekday_3 transaction_weekday_4 transaction_weekday_5 transaction_weekday_6
0 -1.002649 -0.304959 -0.159308 -0.273265 -0.194793 -0.348208 -0.388968 -0.426962 0.107045 1.566055 0.955169 1.438968 0.000000 4.007333 4.007333 0.000000 0.000000 -0.074210 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1.243485 3.622659 3.375352 12.783856 17.507358 -0.142518 4.985016 7.179561 6.510322 0.062220 5.241099 -0.041607 -0.191055 6.655813 6.464758 1.440362 1.717651 -0.000349 1 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
2 1.942282 1.407849 1.287954 9.187284 8.371074 -0.115900 4.022309 3.902087 3.558956 0.057805 0.477154 -0.031180 -0.798508 6.273645 5.475138 -0.230524 2.639057 -0.000671 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
3 1.193571 -0.197032 -0.381964 -0.273265 -0.199596 -0.348208 -0.388968 -0.222376 -0.282269 1.770588 0.096362 -0.187579 -3.806662 3.850148 0.043485 -3.806662 0.000000 -1.925074 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
4 -0.403680 7.338791 2.735216 9.734588 11.923873 -0.104419 4.501402 3.758876 3.966808 0.065162 -0.316837 -0.198005 -0.367725 7.718093 7.350368 0.587787 1.704748 -0.000490 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5950 -1.002649 -0.304959 -0.548956 -0.273265 -0.203199 -0.348208 -0.388968 -0.426962 -0.475072 -2.093474 -0.316837 -0.198005 0.000000 3.713572 3.713572 0.000000 0.000000 -0.092839 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
5951 -1.002649 -0.304959 -0.576788 -0.273265 -0.203799 -0.348208 -0.388968 -0.426962 -0.475072 -2.093474 -0.316837 -0.198005 0.000000 3.688879 3.688879 0.000000 0.000000 -0.094587 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
5952 -1.002649 -0.304959 -0.688116 -0.273265 -0.206201 -0.348208 -0.388968 -0.426962 -0.475072 -2.093474 -0.316837 -0.198005 0.000000 3.583519 3.583519 0.000000 0.000000 -0.102386 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
5953 -1.002649 -0.304959 0.397331 -0.273265 -0.182784 -0.348208 -0.388968 -0.426962 -0.475072 -2.092002 -0.316837 -0.208432 0.000000 4.317488 4.317488 0.000000 0.000000 -0.058344 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
5954 -1.002649 -0.304959 -0.576788 -0.273265 -0.203799 -0.348208 -0.388968 -0.426962 -0.475072 -2.092002 -0.316837 -0.208432 0.000000 3.688879 3.688879 0.000000 0.000000 -0.094587 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

5955 rows × 97 columns

In [327]:
train_columns = set(test1.columns)
test_columns = set(X.columns)
missing_columns = train_columns - test_columns

print("Columns missing in test data:", missing_columns)
Columns missing in test data: set()
In [328]:
test1.columns
Out[328]:
Index(['numofemails', 'average_time_between_mails', 'OrderQuantity',
       'num_transactions', 'sumoforderquantity',
       'average_time_between_purchase', 'days_since_first_transaction',
       'email_duration', 'duration_of_customer_relation', 'recency',
       'Time_to_First_Email', 'Time_since_Last_Email', 'open_rate',
       'click_through_rate', 'response_rate', 'open_rate_ratio',
       'click_through_rate_ratio', 'response_rate_ratio', 'City_CITY2',
       'City_CITY3', 'City_CITY4', 'OnlineCommunication_yes',
       'AutomaticRefill_yes', 'DoorstepDelivery_yes',
       'PreferredDeliveryDay_Monday', 'PreferredDeliveryDay_Thursday',
       'PreferredDeliveryDay_Tuesday', 'PreferredDeliveryDay_Wednesday',
       'PreferredDeliveryDay_weekend', 'EmailType_WelcomeEmail',
       'MailOpened_yes', 'MailClicked_no', 'MailClicked_yes',
       'customer_segment_small_order_customer', 'most_ordered_quarter_Q2',
       'most_ordered_quarter_Q3', 'Registration_month_2',
       'Registration_month_3', 'Registration_month_4', 'Registration_month_5',
       'Registration_month_6', 'Registration_month_7', 'Registration_month_8',
       'Registration_month_9', 'Registration_month_10',
       'Registration_month_11', 'Registration_month_12',
       'RegistrationYear_2009', 'RegistrationYear_2010',
       'RegistrationYear_2011', 'RegistrationYear_2012',
       'RegistrationYear_2013', 'RegistrationYear_2014',
       'RegistrationYear_2015', 'RegistrationYear_2016',
       'RegistrationYear_2017', 'RegistrationYear_2018',
       'Registration_weekday_1', 'Registration_weekday_2',
       'Registration_weekday_3', 'Registration_weekday_4',
       'Registration_weekday_5', 'Registration_weekday_6', 'lastemail_month_2',
       'lastemail_month_3', 'lastemail_month_4', 'lastemail_month_5',
       'lastemail_month_6', 'lastemail_month_7', 'lastemail_month_8',
       'lastemail_month_9', 'lastemail_month_10', 'lastemail_month_11',
       'lastemail_month_12', 'lastemail_weekday_1', 'lastemail_weekday_2',
       'lastemail_weekday_3', 'lastemail_weekday_4', 'lastemail_weekday_5',
       'lastemail_weekday_6', 'transaction_month_2', 'transaction_month_3',
       'transaction_month_4', 'transaction_month_5', 'transaction_month_6',
       'transaction_month_7', 'transaction_month_8', 'transaction_month_9',
       'transaction_month_10', 'transaction_month_11', 'transaction_month_12',
       'transaction_weekday_1', 'transaction_weekday_2',
       'transaction_weekday_3', 'transaction_weekday_4',
       'transaction_weekday_5', 'transaction_weekday_6'],
      dtype='object')
In [329]:
X.columns
Out[329]:
Index(['numofemails', 'average_time_between_mails', 'OrderQuantity',
       'num_transactions', 'sumoforderquantity',
       'average_time_between_purchase', 'days_since_first_transaction',
       'email_duration', 'duration_of_customer_relation', 'recency',
       'Time_to_First_Email', 'Time_since_Last_Email', 'open_rate',
       'click_through_rate', 'response_rate', 'open_rate_ratio',
       'click_through_rate_ratio', 'response_rate_ratio', 'City_CITY2',
       'City_CITY3', 'City_CITY4', 'OnlineCommunication_yes',
       'AutomaticRefill_yes', 'DoorstepDelivery_yes',
       'PreferredDeliveryDay_Monday', 'PreferredDeliveryDay_Thursday',
       'PreferredDeliveryDay_Tuesday', 'PreferredDeliveryDay_Wednesday',
       'PreferredDeliveryDay_weekend', 'EmailType_WelcomeEmail',
       'MailOpened_yes', 'MailClicked_no', 'MailClicked_yes',
       'RegistrationYear_2009', 'RegistrationYear_2010',
       'RegistrationYear_2011', 'RegistrationYear_2012',
       'RegistrationYear_2013', 'RegistrationYear_2014',
       'RegistrationYear_2015', 'RegistrationYear_2016',
       'RegistrationYear_2017', 'RegistrationYear_2018',
       'customer_segment_small_order_customer', 'most_ordered_quarter_Q2',
       'most_ordered_quarter_Q3', 'Registration_month_2',
       'Registration_month_3', 'Registration_month_4', 'Registration_month_5',
       'Registration_month_6', 'Registration_month_7', 'Registration_month_8',
       'Registration_month_9', 'Registration_month_10',
       'Registration_month_11', 'Registration_month_12',
       'Registration_weekday_1', 'Registration_weekday_2',
       'Registration_weekday_3', 'Registration_weekday_4',
       'Registration_weekday_5', 'Registration_weekday_6', 'lastemail_month_2',
       'lastemail_month_3', 'lastemail_month_4', 'lastemail_month_5',
       'lastemail_month_6', 'lastemail_month_7', 'lastemail_month_8',
       'lastemail_month_9', 'lastemail_month_10', 'lastemail_month_11',
       'lastemail_month_12', 'lastemail_weekday_1', 'lastemail_weekday_2',
       'lastemail_weekday_3', 'lastemail_weekday_4', 'lastemail_weekday_5',
       'lastemail_weekday_6', 'transaction_month_2', 'transaction_month_3',
       'transaction_month_4', 'transaction_month_5', 'transaction_month_6',
       'transaction_month_7', 'transaction_month_8', 'transaction_month_9',
       'transaction_month_10', 'transaction_month_11', 'transaction_month_12',
       'transaction_weekday_1', 'transaction_weekday_2',
       'transaction_weekday_3', 'transaction_weekday_4',
       'transaction_weekday_5', 'transaction_weekday_6'],
      dtype='object')
In [330]:
# create a list of feature names from the column names of X_train
feature_names = list(X_train.columns)

# check the feature names used during training
print("Feature names used during training:", feature_names)

# check the feature names in the test data
print("Feature names in test data:", list(test1.columns))

# reorder the columns of the test data to match the order of the feature names in feature_names
test1 = test1[feature_names]

# check the reordered feature names in the test data
print("Feature names in test data after reordering:", list(test1.columns))
Feature names used during training: ['numofemails', 'average_time_between_mails', 'OrderQuantity', 'num_transactions', 'sumoforderquantity', 'average_time_between_purchase', 'days_since_first_transaction', 'email_duration', 'duration_of_customer_relation', 'recency', 'Time_to_First_Email', 'Time_since_Last_Email', 'open_rate', 'click_through_rate', 'response_rate', 'open_rate_ratio', 'click_through_rate_ratio', 'response_rate_ratio', 'City_CITY2', 'City_CITY3', 'City_CITY4', 'OnlineCommunication_yes', 'AutomaticRefill_yes', 'DoorstepDelivery_yes', 'PreferredDeliveryDay_Monday', 'PreferredDeliveryDay_Thursday', 'PreferredDeliveryDay_Tuesday', 'PreferredDeliveryDay_Wednesday', 'PreferredDeliveryDay_weekend', 'EmailType_WelcomeEmail', 'MailOpened_yes', 'MailClicked_no', 'MailClicked_yes', 'RegistrationYear_2009', 'RegistrationYear_2010', 'RegistrationYear_2011', 'RegistrationYear_2012', 'RegistrationYear_2013', 'RegistrationYear_2014', 'RegistrationYear_2015', 'RegistrationYear_2016', 'RegistrationYear_2017', 'RegistrationYear_2018', 'customer_segment_small_order_customer', 'most_ordered_quarter_Q2', 'most_ordered_quarter_Q3', 'Registration_month_2', 'Registration_month_3', 'Registration_month_4', 'Registration_month_5', 'Registration_month_6', 'Registration_month_7', 'Registration_month_8', 'Registration_month_9', 'Registration_month_10', 'Registration_month_11', 'Registration_month_12', 'Registration_weekday_1', 'Registration_weekday_2', 'Registration_weekday_3', 'Registration_weekday_4', 'Registration_weekday_5', 'Registration_weekday_6', 'lastemail_month_2', 'lastemail_month_3', 'lastemail_month_4', 'lastemail_month_5', 'lastemail_month_6', 'lastemail_month_7', 'lastemail_month_8', 'lastemail_month_9', 'lastemail_month_10', 'lastemail_month_11', 'lastemail_month_12', 'lastemail_weekday_1', 'lastemail_weekday_2', 'lastemail_weekday_3', 'lastemail_weekday_4', 'lastemail_weekday_5', 'lastemail_weekday_6', 'transaction_month_2', 'transaction_month_3', 'transaction_month_4', 'transaction_month_5', 'transaction_month_6', 'transaction_month_7', 'transaction_month_8', 'transaction_month_9', 'transaction_month_10', 'transaction_month_11', 'transaction_month_12', 'transaction_weekday_1', 'transaction_weekday_2', 'transaction_weekday_3', 'transaction_weekday_4', 'transaction_weekday_5', 'transaction_weekday_6']
Feature names in test data: ['numofemails', 'average_time_between_mails', 'OrderQuantity', 'num_transactions', 'sumoforderquantity', 'average_time_between_purchase', 'days_since_first_transaction', 'email_duration', 'duration_of_customer_relation', 'recency', 'Time_to_First_Email', 'Time_since_Last_Email', 'open_rate', 'click_through_rate', 'response_rate', 'open_rate_ratio', 'click_through_rate_ratio', 'response_rate_ratio', 'City_CITY2', 'City_CITY3', 'City_CITY4', 'OnlineCommunication_yes', 'AutomaticRefill_yes', 'DoorstepDelivery_yes', 'PreferredDeliveryDay_Monday', 'PreferredDeliveryDay_Thursday', 'PreferredDeliveryDay_Tuesday', 'PreferredDeliveryDay_Wednesday', 'PreferredDeliveryDay_weekend', 'EmailType_WelcomeEmail', 'MailOpened_yes', 'MailClicked_no', 'MailClicked_yes', 'customer_segment_small_order_customer', 'most_ordered_quarter_Q2', 'most_ordered_quarter_Q3', 'Registration_month_2', 'Registration_month_3', 'Registration_month_4', 'Registration_month_5', 'Registration_month_6', 'Registration_month_7', 'Registration_month_8', 'Registration_month_9', 'Registration_month_10', 'Registration_month_11', 'Registration_month_12', 'RegistrationYear_2009', 'RegistrationYear_2010', 'RegistrationYear_2011', 'RegistrationYear_2012', 'RegistrationYear_2013', 'RegistrationYear_2014', 'RegistrationYear_2015', 'RegistrationYear_2016', 'RegistrationYear_2017', 'RegistrationYear_2018', 'Registration_weekday_1', 'Registration_weekday_2', 'Registration_weekday_3', 'Registration_weekday_4', 'Registration_weekday_5', 'Registration_weekday_6', 'lastemail_month_2', 'lastemail_month_3', 'lastemail_month_4', 'lastemail_month_5', 'lastemail_month_6', 'lastemail_month_7', 'lastemail_month_8', 'lastemail_month_9', 'lastemail_month_10', 'lastemail_month_11', 'lastemail_month_12', 'lastemail_weekday_1', 'lastemail_weekday_2', 'lastemail_weekday_3', 'lastemail_weekday_4', 'lastemail_weekday_5', 'lastemail_weekday_6', 'transaction_month_2', 'transaction_month_3', 'transaction_month_4', 'transaction_month_5', 'transaction_month_6', 'transaction_month_7', 'transaction_month_8', 'transaction_month_9', 'transaction_month_10', 'transaction_month_11', 'transaction_month_12', 'transaction_weekday_1', 'transaction_weekday_2', 'transaction_weekday_3', 'transaction_weekday_4', 'transaction_weekday_5', 'transaction_weekday_6']
Feature names in test data after reordering: ['numofemails', 'average_time_between_mails', 'OrderQuantity', 'num_transactions', 'sumoforderquantity', 'average_time_between_purchase', 'days_since_first_transaction', 'email_duration', 'duration_of_customer_relation', 'recency', 'Time_to_First_Email', 'Time_since_Last_Email', 'open_rate', 'click_through_rate', 'response_rate', 'open_rate_ratio', 'click_through_rate_ratio', 'response_rate_ratio', 'City_CITY2', 'City_CITY3', 'City_CITY4', 'OnlineCommunication_yes', 'AutomaticRefill_yes', 'DoorstepDelivery_yes', 'PreferredDeliveryDay_Monday', 'PreferredDeliveryDay_Thursday', 'PreferredDeliveryDay_Tuesday', 'PreferredDeliveryDay_Wednesday', 'PreferredDeliveryDay_weekend', 'EmailType_WelcomeEmail', 'MailOpened_yes', 'MailClicked_no', 'MailClicked_yes', 'RegistrationYear_2009', 'RegistrationYear_2010', 'RegistrationYear_2011', 'RegistrationYear_2012', 'RegistrationYear_2013', 'RegistrationYear_2014', 'RegistrationYear_2015', 'RegistrationYear_2016', 'RegistrationYear_2017', 'RegistrationYear_2018', 'customer_segment_small_order_customer', 'most_ordered_quarter_Q2', 'most_ordered_quarter_Q3', 'Registration_month_2', 'Registration_month_3', 'Registration_month_4', 'Registration_month_5', 'Registration_month_6', 'Registration_month_7', 'Registration_month_8', 'Registration_month_9', 'Registration_month_10', 'Registration_month_11', 'Registration_month_12', 'Registration_weekday_1', 'Registration_weekday_2', 'Registration_weekday_3', 'Registration_weekday_4', 'Registration_weekday_5', 'Registration_weekday_6', 'lastemail_month_2', 'lastemail_month_3', 'lastemail_month_4', 'lastemail_month_5', 'lastemail_month_6', 'lastemail_month_7', 'lastemail_month_8', 'lastemail_month_9', 'lastemail_month_10', 'lastemail_month_11', 'lastemail_month_12', 'lastemail_weekday_1', 'lastemail_weekday_2', 'lastemail_weekday_3', 'lastemail_weekday_4', 'lastemail_weekday_5', 'lastemail_weekday_6', 'transaction_month_2', 'transaction_month_3', 'transaction_month_4', 'transaction_month_5', 'transaction_month_6', 'transaction_month_7', 'transaction_month_8', 'transaction_month_9', 'transaction_month_10', 'transaction_month_11', 'transaction_month_12', 'transaction_weekday_1', 'transaction_weekday_2', 'transaction_weekday_3', 'transaction_weekday_4', 'transaction_weekday_5', 'transaction_weekday_6']
In [331]:
# make predictions on the test data
y_pred = clf_dt_grid.predict(test1)
print("Predictions:", y_pred)
Predictions: [0 1 1 ... 0 0 0]
In [332]:
csv=pd.read_excel("/content/Test.xlsx")
In [333]:
train.columns
Out[333]:
Index(['CustomerID', 'City', 'OnlineCommunication', 'AutomaticRefill',
       'DoorstepDelivery', 'PreferredDeliveryDay', 'RetentionPotential',
       'EmailType', 'MailOpened', 'MailClicked', 'numofemails',
       'average_time_between_mails', 'OrderQuantity', 'num_transactions',
       'sumoforderquantity', 'average_time_between_purchase',
       'days_since_first_transaction', 'email_duration',
       'duration_of_customer_relation', 'recency',
       'days_since_first_transaction_category', 'email_duration_category',
       'RegistrationYear', 'customer_segment', 'Time_to_First_Email',
       'Time_since_Last_Email', 'most_ordered_quarter', 'Registration_month',
       'Registration_weekday', 'lastemail_month', 'lastemail_weekday',
       'transaction_month', 'transaction_weekday', 'open_rate',
       'click_through_rate', 'response_rate', 'open_rate_ratio',
       'click_through_rate_ratio', 'response_rate_ratio'],
      dtype='object')
In [334]:
csv["RetentionPotential"]=y_pred
In [335]:
csv["RetentionPotential"].value_counts()
Out[335]:
1    3093
0    2414
2     448
Name: RetentionPotential, dtype: int64

12)CUSTOMER SEGMENTATION FOR TEST DATA¶

In [336]:
#rfmsegmentation
#lets segment our customers bases upon their transaction records
#if we observe our last transaction was on date 2018-01-21
transaction["DateOfOrder"].describe()
Out[336]:
count                   26768
unique                   2771
top       2014-01-13 00:00:00
freq                       82
first     2008-08-05 00:00:00
last      2018-01-21 00:00:00
Name: DateOfOrder, dtype: object
In [337]:
#lets check days since first transaction for each record then we will figure out accordingly.
transaction['days_since'] =  pd.to_datetime(transaction['DateOfOrder']).apply(lambda x: (pd.to_datetime("2018-01-21") - x).days)
In [338]:
## Compute recency, frequency, and monetary_value
def rec_freq_value(x):
    recency        = x['days_since'].min()#how recently a customer was active
    first_purchase = x['days_since'].max()#out off all the days,the max difference is the first purchase he had made
    frequency      = x['days_since'].count()#how many times a customer had transacted between his first and last
    order_value = x['OrderQuantity'].mean()#how much quantity he had ordered,average of the total orders he had ordered
    c = ['recency', 'first_purchase', 'frequency', 'order_value']
    return pd.Series([recency, first_purchase, frequency, order_value], index = c)
In [339]:
#applying the function on my whole records
RFM = transaction.groupby('CustomerID').apply(rec_freq_value)
In [340]:
#lets combine email behaviour with rfm 
emailresponses=test[['CustomerID','Totalmails-notoponed', 'Totalmails-oponed', 'timesclicked-yes', 'timesclicked-no','numofemails']]
In [341]:
# Combine RFM scores and email response data
# Combine RFM scores and email response data
rfm_email = pd.merge(RFM,emailresponses, on=["CustomerID"])
rfm_email=rfm_email.drop(["CustomerID"],axis=1)

# Normalize the RFM scores and email response data
rfm_email_norm = (rfm_email - rfm_email.mean()) / rfm_email.std()
In [342]:
test_labels=KMean_clust.predict(rfm_email_norm)
In [343]:
#Find the clusters for the observation given in the dataset
rfm_email['Cluster'] = test_labels
rfm_email.head()
Out[343]:
recency first_purchase frequency order_value Totalmails-notoponed Totalmails-oponed timesclicked-yes timesclicked-no numofemails Cluster
0 38.0 38.0 1.0 52.0 16 9 21 3 24 0
1 1554.0 1554.0 1.0 40.0 0 1 0 0 1 4
2 1586.0 1586.0 1.0 40.0 0 1 0 0 1 4
3 11.0 11.0 1.0 41.0 21 9 24 5 29 0
4 1555.0 1555.0 1.0 79.0 38 4 40 1 41 1
In [344]:
from matplotlib import pyplot as plt
plt.figure(figsize=(7,7))

##Scatter Plot Frequency Vs Recency
Colors = ["red", "green", "blue","black","yellow"]
rfm_email['Color'] = rfm_email['Cluster'].map(lambda p: Colors[p])
ax = rfm_email.plot(    
    kind="scatter", 
    x="recency", y="frequency",
    figsize=(10,8),
    c = rfm_email['Color']
)
<Figure size 700x700 with 0 Axes>
In [345]:
ax = rfm_email.plot(    
    kind="scatter", 
    x="order_value", y="frequency",
    figsize=(10,8),
    c = rfm_email['Color']
)
In [346]:
rfm_email['Cluster Name'] = ''
rfm_email['Cluster Name'][rfm_email['Cluster']==0] = 'midfrequency'
rfm_email['Cluster Name'][rfm_email['Cluster']==1] = 'Higherrecency-lessfrequent '
rfm_email['Cluster Name'][rfm_email['Cluster']==2] = 'recenttransactionswithlessfrequency'
rfm_email['Cluster Name'][rfm_email['Cluster']==3] = 'midfrequencywithrecenttransactionatintervals'
rfm_email['Cluster Name'][rfm_email['Cluster']==4] = 'higherfrequencyatdifferentintervals'
In [347]:
csv = csv.reset_index().drop('index', axis=1)
csv['Cluster Name'] = rfm_email['Cluster Name']
csv['Cluster Name']=csv['Cluster Name'].astype("category")
In [348]:
csv
Out[348]:
CustomerID RetentionPotential Cluster Name
0 C12121 0 midfrequency
1 C12132 1 higherfrequencyatdifferentintervals
2 C12134 1 higherfrequencyatdifferentintervals
3 C12135 1 midfrequency
4 C12136 2 Higherrecency-lessfrequent
... ... ... ...
5950 C42153 0 higherfrequencyatdifferentintervals
5951 C42155 0 higherfrequencyatdifferentintervals
5952 C42157 0 higherfrequencyatdifferentintervals
5953 C42167 0 Higherrecency-lessfrequent
5954 C42190 0 higherfrequencyatdifferentintervals

5955 rows × 3 columns

13)Final csv file¶

In [349]:
# save dataframe to a csv file
csv.to_csv('my_final_predictions-withclusters.csv', index=False)